ISO:Operations

From Hornbill
Revision as of 08:17, 13 August 2020 by AbdiH (talk | contribs) (→‎Software)
Jump to navigation Jump to search

Capacity management

Network utilization, disk utilisation and server load is monitored by the Nagios tools. This tool provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues.

We have hardware available for our expected growth of Hornbill and this is reviewed\increased every 3 months with the purchasing of additional hypervisors\rack space. If required, we can also create a instance or complete replica of the Hornbill infrastructure in AWS in record time meaning that capacity and scalability is never an issue. This scalability along with the underlying server code also removes all limitations for user increase as new servers can be added as demand increases.

Network utilization, disk utilisation and server load is monitored realtime by collection of over 100 data points (CPU\RAM\HDD utilization for all services etc) for use in graphing and additionally collected Nagios. These tools\charts\engines provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues.

Monitoring

All Instance, Services and Hardware are monitored from several locations around the world (Each monitor server acts as a backup to the primary and results compared). We check over 100 different metrics per instance (and anything that an instance may require) every 5 minutes to ensure all is well. Any warning is logged and escalated to the Cloud Team.

Checks include (Not comprehensive list)

  • Performance (Pings, DNS Propagation, Response times from API, CPU Load, RAM Load, Disk IO, network Load etc)
  • Hardware (Availability, Temperature, SMART, SNMP etc)
  • Capacity (Disk Space, CPU, RAM etc)
  • Availability (Ping, DNS Propagation, API Tests, Host controller checks etc)
  • Security (Automated Log file reviews, Traffic review, Pattern analysis, etc)
  • IDS (Intrusion Detection, Suspicious or Malicious Traffic Analysis including, packet\bandwidth\source\traffic monitoring)
  • Data Leakage ( Packet\bandwidth\Source & Destination\traffic monitoring and Analysis. )
  • Backups (Sync checks, replication checks, Off instance checks etc)
  • Sanity (Checks for Mail Queues, Expected load etc).

Backups

All databases are replicated real time to separate data center and all files replicated off site within 5 minutes. These replicas are then backed up (individual secure archive encrypted with 1 time key) each evening and stored off site within S3 for the retention period specified in contracts. The backups are taken without any interruption of services

Our 'Maximum Data Loss Time Period' or RPO is a maximum of 24 hours (or the time back to the last 23:00 backup). However, we aim for 15 minutes, as we replicate customer data at this frequency. Hornbills RTO 'Recovery Time Objective' in the event Hornbill has to invoke its DR (Disaster Recovery) plan is defined as

Emergency response to assess level of damage, decide whether to invoke the plan and at what level, to notify staff etc. (to be completed within 1 – 2 business hours of the disaster)

  • Provision of an emergency level of service (within 4 business hours of the disaster)
  • Restoration of key services (within 8 business hrs of the disaster)
  • Recovery to business as normal. (within one week of the disaster)

Emergency level of Service is to ensure our customers and their customers can use the Hornbill Services and applications with minimal disruption. To this end all Applications and databases will be restored however file attachments (Associated with Emails, Workspaces, Document manager or Request) might not be available, search functionality will be limited.

Restoration of Key services will be to provide the customer with a fully working system and no difference than what they had before the DR plan was activated. All Applications, Databases, File Attachments and functionality restored.

Recovery to business as Normal would only ever be needed should a true Disaster occur. This would include the total loss of 1 or more data centers AND Hornbill offices at the same time. The Recovery to business as Normal would ensure that all Hornbill services (both customer facing and internal) were fully restored).

In order to achieve the RPO and RTO targets we perform file replication of customer instances (and all servers) to off-site location at least every 15 minutes as well as Realtime database replication (again off-site). Both actions are monitored and any delay over 1 hour is flagged as critical. Nightly backups are then taken from the above locations and stored locally and offsite (3rd location).

Therefore, should a failure exist on Primary hardware we can recover from replicated files (Max 15mins) or in complete disaster tertiary 3rd backups

Access

All Logins to systems processing customer data will automatically send a report to the Hornbill Audit Logs mailbox. This Login must then be associated with a given service manager request or Hornbill workspace post via Cloud Login Audit catalogue item to ensure a valid reason exists to login. These Logins are then audited by the Security manager to ensure no unauthorised access was performed.


Backups are restored (and therefore restore process tested) nightly to ZIP before being pushed to offsite location and a random backup restore is performed on schedule basis to ensure that backups are correct\valid.

Temporary Files

All temporary files on systems that process customer data are deleted after 24 hours. All other systems are set to clear temp folders on reboot. These are purged at 0100 on Nightly basis from all nodes

Customer Access to Logs

Customers can access their own logs via the Admin portal, these logs are restricted to their instance and any shared logs (Such as Web front end, DataService logs etc) are not available. This ensures that the cloud service customer can only access records that relate to that cloud service customer’s activities and cannot access any log records which relate to the activities of other cloud service customer. Customer accessible logs are available for the current day (Note that this is different to Audit Logs, such as security logs which are available for 6months) via the portal and on request for 2 months from Hornbill (requests should be submitted to data.processor@live.hornbill.com ) by the nominated contacts for a given instance (Technical or Data Security).

Software

Only approved Software may be installed on all desktops\servers utilised by Hornbill. The source for which is stored within a central repository to ensure that even in a disaster we can install as required. All software utilised by Hornbill is reviewed on a scheduled basis.

Hardware

Only hardware provided by the IT team and obtained via existing approved vendors may be used to access the management or customer networks.