CAS-based generic data store

by Ketil Malde; August 3, 2016

Bioinformatics projects routinely generate terabytes of sequencing data, and the inevitable analysis that follows can easily increase this by an order of magnitude or more. Not everything is worth keeping, but in order to ensure reproducibility and to be able to reuse data in new projects, it is important to store what needs to be kept in a structured way.

I have previously described and implemented a generic data store, called medusa. Following the eXtreme Programming principle of always starting with the simplest implementation that could possibly work, the system was designed around a storage based on files and directories. This has worked reasonably well, and makes data discoverable and accessible both directly in the file system, and through web-based services providing browsing, metadata search (with free text and keyword based indexes), BLAST search, and so forth.

Here, I explore the concept of content adressable storage (CAS), which derives unique names for data objects from their content.

The CAS principle

The essence of any storage system is being able to store objects with some kind of key (or label, or ID), and being able to retrive them based on the same key. What distinguishes a content adressable storage from other storage systems is that the key is generated from the entire data object, typically using a cryptographic hash function like MD5 or SHA1.

This means that a given object will always be stored under the same key, and that modifications to an object will also change its key, essentially creating a new object.

A layered model

Using CAS more clearly separates the storage model from the semantics of data sets. This gives us a layered architecture for the complete system, and services are implemented on top of these layers as independent and modular programs.

The object store

The object store is conceptually simple. It provides a simple interface that consists of the following primitive operations:

put: a data object into the store
list: the keys that refer to data objects
get: a data object using its key

The storage itself is completely oblivious to the actual contents of data objects, and it has no concept of hierarchy or other relationships between objects.

Metadata semantics

When we organize data, we do of course want to include relationships between objects, and also between data objects and external entities and concepts. This is the province of metadata. Metadata semantics are provided by special metadata objects which live in the object store like any other objects. Each metadata object defines and describes a specific data set. As in former incarnations of the system, metadata is structured as XML documents, and provides information about (and the identity of) the data objects constituting the data set. It also describes the relationship between data sets, for instance allowing new versions to obsolete older ones.

The metadata objects are primarily free-form text objects, allowing users to include whatever information they deem relevant and important. The purpose of using XML is to make specific parts of the information computationally accessible, unambiguous, and standardized. For instance, structured references (i.e. specific XML elements) to data objects with their key allows automatic retrieval of the complete dataset. In addition to referencing objects in the object store, similar structures allow unambigous references to external entities, for instance species, citation of scientific works, and uniform formatting of dates and geographic locations.

A command line interface to the metadata is provided through the `mdz` command, this allows a variety of operations on data sets, including listing, importing, exporting, and synchronizing with other repositories. In addition, the system implements a web-based front end to the data, as well as metatdata indexing via xapian.

Data objects and services

As shown in the previous sections, the system can be conceptually divided in three levels: the object store, the metadata level, and the data semantic level. A service typically accesses data on one or more of these levels. For instance, a (hypothetical) service to ensure distributed redundancy may only need to access the object store, oblivious to the contents of the objects. Other services, like the (existing) functionality to import data sets, or transfer data sets between different servers, need to understand the metadata format. And even more specific services may also need to understand the format of data objects - e.g. the BLAST service scans metadata to find FASTA-formatted sequence data, and integrate them into its own database. The important principles that services adhere to are: 1) a service can ignore anything that is irrelevant to it, and 2) can reconstruct its entire state from the contents of the object store.

Discussion

CAS Advantages

Perhaps the primary advantage of using the hash value as the ID for data objects, is that it allows the system to be entirely distributed. The crucial advantage is that keys (by definition) are unique to the data. With user-selected keys, the user must somehow ensure the uniqueness of the key, and this requires a central authority or at the very least an agreed-upon naming scheme. In contrast, names for objects in CAS depend only on the contents, and the system can be implemented with no central oversight.

That keys depend on contents further means that data are immutable - storing a modified data object results in a different key. Immutability is central to reproducibility (you won't get the same results if you run your analysis with different data), and previously this was maintained by keeping a separate registry of metadata checksums, and also including checksums for data objects in the metadata. This made it possible to verify correctness (as long as the registry was available and correct), with CAS, this becomes even easier since the checksum is the same as the name you use to retrieve the data object.

Another benefit is deduplication of data objects. Objects with the same contents will always be stored under the same key, so this is automatic. This also makes it easier to track files across renames (analyses tend to produce output files with generic names like "contigs.fasta", it is often useful to give these files a more descriptive name), with CAS it becomes trivial to check if any file exists in the storage.

Decoupling the data from a fixed filesystem layout introduces another level of abstraction, and this makes it easier to change the underlying storage model. In later years, key-value storage models have replaced relational databases in many applications, in particular where high scalability is more important than structured data. Consequently, we have seen a plethora of so-called "NoSQL" databases emerge, including CouchDB, Cassandra, and many others, which could be plugged in as an alternative back-end storage. Storage "in the cloud", like Amazon's S3 or Google's Cloud Storage are also good alternatives.

The added opacity makes it less likely (but still technically possible) for users with sufficient privileges to perform "illegal" operations on data (for instance, modification or removal).

Disadvantages

The implicit assumption for CAS is that different data objects hash to different hash values. In an absolute sense, this is trivially false (since there only exist 2¹⁶⁰ possible hash values, and an infinity of possible data objects). But it is true in a probabilistic sense, and we can calculate the probability of collisions from the birthday paradox. For practical purposes, any collision is extremely unlikely, and like the revision control system git (which also is CAS-based), collisions are checked for by the system, and can be dealt with manually if they should occur.

Abstracting out the storage layer can be an advantage, but it also makes the system more opaque. And although the ability of humans to select misleading or confusing names can hardly be underestimated, even a poorly chosen name is usually more informative than the hexadecimal key representing a hash value.

Mixed blessings

Previous versions used a fixed directory structure, where each data set included a metadata file, and an arbitrary set of data files. Using a content adressable object store is more flexible, and there is nothing preventing the implementation of a parallel metadata scheme sharing the same data store, and even referring to the same data objects. One could also create metadata objects that refer to other metadata objects. As always, fewer restrictions also means more opportunities for confusion and increased complexity.

Perhaps the most drastic change is how datasets can have their status changed - e.g. be marked as obsolete or invalid. Previously, metadata was versioned, meaning there could exist a (linear) sequence of metadata for the same dataset. This was enforced by convention only, and also required a central synchronization of metadata updates to avoid name and version collisions. Since the object store only allows the addition of new objects, and in particular, not modification, status updates can only be achieved by adding new objects. Metadata objects can refer to other datasets, and specify a context, for instance, a data set containing analysis results can specify being based on a data set containing input data for the analysis. Status changes are now implemented using this mechanism, and datasets can refer to other data sets as "invalidated" or "obsoleted".

Current Status and Availability

The system is currently working on my internal systems, it is based on standard components (mostly shell scripts), and although one might expect some rough edges, it should be fairly easy to deploy.

Do let me know if you are interested.

comments powered by Disqus

Feedback? Please email ketil@malde.org.