Difference between revisions of "ISO:Operations"

From Hornbill
Jump to navigation Jump to search
(Created page with "==Capacity management== Network utilization, disk utilisation and server load is monitored by the Nagios tools. This tool provides automatic alerts when pre-set thresholds a...")
 
Line 2: Line 2:
 
 
 
Network utilization, disk utilisation and server load is monitored by the Nagios tools. This tool provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues.
 
Network utilization, disk utilisation and server load is monitored by the Nagios tools. This tool provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues.
The network has sufficient capacity for current and anticipated needs – this is kept under review by the Chief Technical Officer.
 
  
 +
We have hardware available for our expected growth of Hornbill and this is reviewed\increased every 3 months with the purchasing of additional hypervisors\rack space. If required, we can also create a instance or complete replica of the Hornbill infrastructure in AWS in record time meaning that capacity and scalability is never an issue. This scalability along with the underlying server code also removes all limitations for user increase as new servers can be added as demand increases.
 +
 +
Network utilization, disk utilisation and server load is monitored realtime by collection of over 100 data points (CPU\RAM\HDD  utilization for all services etc) for use in graphing and additionally collected Nagios. These tools\charts\engines provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues.
  
 
==Monitoring ==
 
==Monitoring ==

Revision as of 13:42, 26 September 2018

Capacity management

Network utilization, disk utilisation and server load is monitored by the Nagios tools. This tool provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues.

We have hardware available for our expected growth of Hornbill and this is reviewed\increased every 3 months with the purchasing of additional hypervisors\rack space. If required, we can also create a instance or complete replica of the Hornbill infrastructure in AWS in record time meaning that capacity and scalability is never an issue. This scalability along with the underlying server code also removes all limitations for user increase as new servers can be added as demand increases.

Network utilization, disk utilisation and server load is monitored realtime by collection of over 100 data points (CPU\RAM\HDD utilization for all services etc) for use in graphing and additionally collected Nagios. These tools\charts\engines provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues.

Monitoring

All Instance, Services and Hardware are monitored from several locations around the world (Each monitor server acts as a backup to the primary and results compared). We check over 100 different metrics per instance (and anything that an instance may require) every 5 minutes to ensure all is well. Any warning is logged and escalated to the Cloud Team.

Checks include (Not comprehensive list)

  • Performance (Pings, DNS Propagation, Response times from API, CPU Load, RAM Load, Disk IO, network Load etc)
  • Hardware (Availability, Temperature, SMART, SNMP etc)
  • Capacity (Disk Space, CPU, RAM etc)
  • Availability (Ping, DNS Propagation, API Tests, Host controller checks etc)
  • Security (Automated Log file reviews, Traffic review, Pattern analysis, etc)
  • IDS (Intrusion Detection, Suspicious or Malicious Traffic Analysis including, packet\bandwidth\source\traffic monitoring)
  • Data Leakage ( Packet\bandwidth\Source & Destination\traffic monitoring and Analysis. )
  • Backups (Sync checks, replication checks, Off instance checks etc)
  • Sanity (Checks for Mail Queues, Expected load etc).

Backups

All databases are replicated real time to separate data center and all files replicated off site within 5 minutes. These replicas are then backed up (individual secure archive encrypted with 1 time key) each evening and stored off site within S3 for the retention period specified in contracts. The backups are taken without any interruption of services