Compressing biological sequences

by Ketil Malde; June 22, 2011

Recently, I came across a discussion about compression methods for DNA. Interestingly, standard compression algorithms tend to perform rather poorly, giving compression rates similar to just packing the DNA sequence using two bits for each letter (the alphabet being A, C, G, and T). One effort suggest using a known reference sequence, and effectively only storing the differences, which potentially can save a lot of space. Of course, a large part of the data is quality information, and for that the reference isn’t going to help, so lossy compression is suggested to deal with that.

Although lossy compression is used successfully in many cases, it depends on advance knowledge of which data is not going to be needed in the future. And using a reference means additional complexity, you need to ensure that link to the correct reference is maintained. Thus, I made a small experment to see how far bog-standard, plain old and boring compression would go.

Compressing FASTQ sequences

The FASTQ file is pretty simple. It is just a series of concatenated records, where each record consists of a header (starting with an ‘@’), giving the read name, then the sequence data, then a repetition of the header (this time with a ‘+’), and then the quality data, encoded as ASCII values (using an offset for making them printable characters). So it’s something like this:

@read1
ACGT...
+read1
fffffhhgbbb...
@read2
:

To check how this compresses, I took a file from a typical Illumina run containing 100M reads and weighing in at about 25GB, and ran various compression programs on it. Here are the results:

Compression Size Relative
none 25899 100%
gzip 8012 31%
bzip2 6421 25%
lzma 6412 25%

So we see the choice of compression algorithm doesn’t matter all that much, although both bzip2 and lzma do better than plain gzip.

Compressing different data separately

Now, the FASTQ format intermixes the different data types, and since they are very different, this decreases the redundancy within a block of a given size and the efficiency of a dictionary. Let’s instead extract the different components, and compress them separately. And let’s ignore the +-header lines, they’re just repeating the @-headers. This gives us the following table:

Compression Heads Seqs Quals Relative
none 3790 9184 9136 85%
gzip 460 2691 3602 84%
bzip2 345 2534 3157 94%
lzma 254 2157 3126 86%

The percentages are the sum of sizes relative to the compressed whole file. Note that the whole file includes, essentially, two copies of the headers. Except for bzip2, which managed to compress the whole file very well, the gain is about proportional to the size of the removed headers. This is slightly depressing, since you’d expect a good compressor to eliminate the header redundancy anyway.

Ordering before compression

One downside here is that the order of the sequences is arbitrary. You’d expect compression ratios to increase if similar sequences could be placed close to each other. Let try a very simple heuristic for that, namely sorting the sequences:

Compression Seqs Relative
none 9184 100%
gzip 1721 64%
bzip2 1965 78%
lzma 1384 64%

Percentages are size relative to size of the compressed, unsorted sequences. Perhaps unsurprisingly, this helps quite a bit, especially for the dictionary-using compression algorithms.

Results

Compression Heads Seqs Quals Total Relative
gzip 460 1721 3602 5783 22%
bzip2 345 2534 3157 6036 23%
lzma 254 1384 3126 4764 18%

The clear winner is lzma, it is better in all categories, and produces a total output that is about four fifths the size of the others. More surprising, perhaps, is that gzip produces better compression than bzip2, and it is faster and more ubiquitous.

Also interesting, by a straightforward rearranging of sequence data, we gain 33% compared to plain compression. A more advanced grouping/clustering might improve this even further, one possibility is aligning the sequences to a reference, and ordering them by alignment position. This would likely improve compressibility, but without introducing any explicit dependency on the reference.

comments powered by Disqus
Feedback? Please email ketil@malde.org.