ISO:Operations

From Hornbill
Revision as of 09:11, 26 September 2018 by Keiths (talk | contribs) (Created page with "==Capacity management== Network utilization, disk utilisation and server load is monitored by the Nagios tools. This tool provides automatic alerts when pre-set thresholds a...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Capacity management

Network utilization, disk utilisation and server load is monitored by the Nagios tools. This tool provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues. The network has sufficient capacity for current and anticipated needs – this is kept under review by the Chief Technical Officer.


Monitoring

All Instance, Services and Hardware are monitored from several locations around the world (Each monitor server acts as a backup to the primary and results compared). We check over 100 different metrics per instance (and anything that an instance may require) every 5 minutes to ensure all is well. Any warning is logged and escalated to the Cloud Team.

Checks include (Not comprehensive list)

  • Performance (Pings, DNS Propagation, Response times from API, CPU Load, RAM Load, Disk IO, network Load etc)
  • Hardware (Availability, Temperature, SMART, SNMP etc)
  • Capacity (Disk Space, CPU, RAM etc)
  • Availability (Ping, DNS Propagation, API Tests, Host controller checks etc)
  • Security (Automated Log file reviews, Traffic review, Pattern analysis, etc)
  • IDS (Intrusion Detection, Suspicious or Malicious Traffic Analysis including, packet\bandwidth\source\traffic monitoring)
  • Data Leakage ( Packet\bandwidth\Source & Destination\traffic monitoring and Analysis. )
  • Backups (Sync checks, replication checks, Off instance checks etc)
  • Sanity (Checks for Mail Queues, Expected load etc).

Backups

All databases are replicated real time to separate data center and all files replicated off site within 5 minutes. These replicas are then backed up (individual secure archive encrypted with 1 time key) each evening and stored off site within S3 for the retention period specified in contracts. The backups are taken without any interruption of services