ISO:Operations

From Hornbill
Jump to navigation Jump to search

Capacity management

Network utilization, disk utilisation and server load is monitored. This provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues.

We have hardware available for our expected growth of Hornbill and this is reviewed\increased every 3 months with the purchasing of additional hypervisors\rack space. If required, we can also create a instance or complete replica of the Hornbill infrastructure in AWS in record time meaning that capacity and scalability is never an issue. This scalability along with the underlying server code also removes all limitations for user increase as new servers can be added as demand increases.

Network utilization, disk utilisation and server load is monitored realtime by collection of over 1000 data points (CPU\RAM\HDD utilization for all services etc) for use in graphing\realtime monitoring. These tools\charts\engines provides automatic alerts when pre-set thresholds are exceeded. The Chief Technical Officer is responsible for ensuring this monitoring is conducted and action taken to resolve any issues.

Monitoring

All Instance, Services and Hardware are monitored from several locations around the world (Each monitor server acts as a backup to the primary and results compared). We check over 1000 different metrics per instance (and anything that an instance may require) every 5 minutes to ensure all is well. Any warning is logged and escalated to the Cloud Team.

Checks include (Not comprehensive list)

  • Performance (Pings, DNS Propagation, Response times from API, CPU Load, RAM Load, Disk IO, network Load etc)
  • Hardware (Availability, Temperature, SMART, SNMP etc)
  • Capacity (Disk Space, CPU, RAM etc)
  • Availability (Ping, DNS Propagation, API Tests, Host controller checks etc)
  • Security (Automated Log file reviews, Traffic review, Pattern analysis, etc)
  • IDS (Intrusion Detection, Suspicious or Malicious Traffic Analysis including, packet\bandwidth\source\traffic monitoring)
  • Data Leakage ( Packet\bandwidth\Source & Destination\traffic monitoring and Analysis. )
  • Backups (Sync checks, replication checks, Off instance checks etc)
  • Sanity (Checks for Mail Queues, Expected load etc).
  • SIEM (APIs\Resource Usuage\Network Traffic and DB Access\Requests)

Backups

All databases are replicated real time to separate data center and all files replicated off site within 5 minutes. These replicas are then backed up (individual secure archive encrypted with 1 time key) each evening and stored off site within S3 for the retention period specified in contracts. The backups are taken without any interruption of services

Our 'Maximum Data Loss Time Period' or RPO is a maximum of 24 hours (or the time back to the last 23:00 backup). However, we aim for 15 minutes, as we replicate customer data at this frequency. Hornbills RTO 'Recovery Time Objective' in the event Hornbill has to invoke its DR (Disaster Recovery) plan is defined as

Emergency response to assess level of damage, decide whether to invoke the plan and at what level, to notify staff etc. (to be completed within 1 – 2 business hours of the disaster)

  • Provision of an emergency level of service (within 4 business hours of the disaster)
  • Restoration of key services (within 8 business hrs of the disaster)
  • Recovery to business as normal. (within one week of the disaster)

Emergency level of Service is to ensure our customers and their customers can use the Hornbill Services and applications with minimal disruption. To this end all Applications and databases will be restored however file attachments (Associated with Emails, Workspaces, Document manager or Request) might not be available, search functionality will be limited.

Restoration of Key services will be to provide the customer with a fully working system and no difference than what they had before the DR plan was activated. All Applications, Databases, File Attachments and functionality restored.

Recovery to business as Normal would only ever be needed should a true Disaster occur. This would include the total loss of 1 or more data centers AND Hornbill offices at the same time. The Recovery to business as Normal would ensure that all Hornbill services (both customer facing and internal) were fully restored).

In order to achieve the RPO and RTO targets we perform file replication of customer instances (and all servers) to off-site location at least every 15 minutes as well as Realtime database replication (again off-site). Both actions are monitored and any delay over 1 hour is flagged as critical. Nightly backups are then taken from the above locations and stored locally and offsite (3rd location).

Therefore, should a failure exist on Primary hardware we can recover from replicated files (Max 15mins) or in complete disaster tertiary 3rd backups

Backups are checked for integrity automatically at time of taking, upload to S3 and at different levels either Weekly or Monthly.

Access

All Logins to systems processing customer data will automatically send a report to the Hornbill Audit Logs mailbox. This Login must then be associated with a given service manager request or Hornbill workspace post via Cloud Login Audit catalogue item to ensure a valid reason exists to login. These Logins are then audited by the Security manager to ensure no unauthorised access was performed.


Backups are restored (and therefore restore process tested) nightly to ZIP before being pushed to offsite location and a random backup restore is performed on schedule basis to ensure that backups are correct\valid.

Temporary Files

All temporary files on systems that process customer data are deleted after 24 hours. All other systems are set to clear temp folders on reboot. These are purged at 0100 on Nightly basis from all nodes

Customer Access to Logs

Customers can access their own logs via the Admin portal, these logs are restricted to their instance and any shared logs (Such as Web front end, DataService logs etc) are not available. This ensures that the cloud service customer can only access records that relate to that cloud service customer’s activities and cannot access any log records which relate to the activities of other cloud service customer. Customer accessible logs are available for the current day (Note that this is different to Audit Logs, such as security logs which are available for 6months) via the portal and on request for 2 months from Hornbill (requests should be submitted to data.processor-hornbill@live.hornbill.com ) by the nominated contacts for a given instance (Technical or Data Security).

Customer Access to Audit and Access Logs.

The above logs are more aligned to identifying issues or misconfiguration in processes or other aspects of the application rather than Audit\Security or Access. The logs used for Audit\Security or Access are typically far larger based on the sheer volume of data and are kept for 7 days. These should therefore be exported by the customer (via Scheduled report or other integration) to a repository of their choice.

The logs

  • Primary Security Log - Contains all Login Requests and the source IP, timestamp, target (portal, live, admin etc) , result of the action and Unique ID.
  • Primary Audit Log - Contains all pages\entities accessed and the UniqueID of actor and timestamp of action.
  • Application Audit Log - Each application has its own Log table containing timestamp, action (Insert, Update, Delete records etc), Unique ID (linked to above), result of the action and previous and subsequent values. See each applications

documentation for a full list of Audited actions.

Software

Only approved Software may be installed on all desktops\servers utilised by Hornbill. The source for which is stored within a central repository to ensure that even in a disaster we can install as required. All software utilised by Hornbill is reviewed on a scheduled basis.

Software is managed\deployed through central systems (Anisble\Hornbill Tools\Hornbill ITOM) to ensure correct deployment and configuration.

All software is hardened inline with Vendor, Industry and Hornbills own polices\standards. This includes, only required software\services per machine, locked down ports\individual users and service account etc. All hardening is confirmed via monitoring and any changes would automatically escalated and automatically reverted within 5 minutes of any unsanctioned change.

Hardware

Only hardware provided by the IT team and obtained via existing approved vendors may be used to access the management or customer networks. All Clocks are syncronized with NTP and checked to be within 1 minute of primary servers. All default passwords changed. All hardening is inline with Vendor, Industry and Hornbills own polices\standards. All hardening is confirmed via monitoring and any changes would automatically escalated and automatically reverted within 5 minutes of any unsanctioned change.