Content Addressable File Store

From Hornbill
Jump to navigation Jump to search

Content Addressable File Storage - Technology Overview

The Hornbill platform implements a custom file storage scheme that allows our platform and applications to store and retrieve file-based artefacts in a consistent and reliable way. It is often desirable within an application to be able to store files or other static objects against specific business such as contacts, requests or assets for example. It’s important that storage is implemented in such a way that lends its self to working within a distributed environment and its highly desirable that content is only stored once to optimise the use of space, the strategy of using a "content addressable" file store meets these criteria.

A content addressable file store is different to a typical disk-based file store. Content-addressable storage, also referred to as associative storage or abbreviated CAS, is a mechanism for storing information that can be retrieved based on its content, not its storage location. It is typically used for high-speed storage and retrieval of fixed content, such as documents stored for compliance with government regulations. Roughly speaking, content-addressable storage is the permanent-storage analogue to content-addressable memory. This works by creating a unique key using a cryptography hash that uniquely represents the actual content, you can think of the key as a fingerprint of the content that is unique for each file.

The CAFS strategy addresses each file using a unique key, typically a digest generated by a cryptographic hash function (such as MD5 or SHA-1) from the document it refers to. If the hash function is weak, this method could be subject to collisions in an adversarial environment (different documents returning the same hash). The main advantages of this CAS strategy though are that the location of the actual data and the number of copies is unknown to the user which underpins the ability to distribute content in numerous ways. SHA-xxx is one of the most cryptographically secure HASH algorithms in use today with no examples of collision yet discovered. Subversion for example uses CAS techniques and bases its hashes on MD5 which is weaker than SHA-xxx demonstrating that the strategy still actually works pretty well.

Space Efficiency

In today’s world of computing, storage is cheap so it has become common for files to get bigger and for stored content to become bloated and unmanaged. In fact storage is so cheap that it is actually cheaper to buy more disks than it is to spend time trying to manage what you store. However for Hornbill we like come up with elegant and innovative technology solutions. Our goal was for our content to be at least to some degree self-organising so we can offer a service that can keep your content on line forever. One of the big things (apart from a distributed storage architecture) we wanted to address was the problem of content duplication in storage. Take for example the idea that we receive an e-mail with a 10Mb attachment, and that mail is forwarded on to 20 people, each person will now have an exact copy of the file attachment content in their mailbox, now stored 20 times. By storing the attachments in our CAFS scheme, with the same scenario we will only ever store one copy of each attachment because its content is the same so de-duplication of content is completely automatic. As a cloud service provider we get more efficient use of our storage resources and our customers get a lower cost service as a result.