Compressing biological sequences

by Ketil Malde; June 22, 2011

Recently, I came across a discussion about compression methods for DNA. Interestingly, standard compression algorithms tend to perform rather poorly, giving compression rates similar to just packing the DNA sequence using two bits for each letter (the alphabet being A, C, G, and T). One effort suggest using a known reference sequence, and effectively only storing the differences, which potentially can save a lot of space. Of course, a large part of the data is quality information, and for that the reference isn’t going to help, so lossy compression is suggested to deal with that.

Although lossy compression is used successfully in many cases, it depends on advance knowledge of which data is not going to be needed in the future. And using a reference means additional complexity, you need to ensure that link to the correct reference is maintained. Thus, I made a small experment to see how far bog-standard, plain old and boring compression would go.

Compressing FASTQ sequences

The FASTQ file is pretty simple. It is just a series of concatenated records, where each record consists of a header (starting with an ‘@’), giving the read name, then the sequence data, then a repetition of the header (this time with a ‘+’), and then the quality data, encoded as ASCII values (using an offset for making them printable characters). So it’s something like this:

@read1
ACGT...
+read1
fffffhhgbbb...
@read2
:

To check how this compresses, I took a file from a typical Illumina run containing 100M reads and weighing in at about 25GB, and ran various compression programs on it. Here are the results:

Compression	Size	Relative
none	25899	100%
gzip	8012	31%
bzip2	6421	25%
lzma	6412	25%

So we see the choice of compression algorithm doesn’t matter all that much, although both bzip2 and lzma do better than plain gzip.

Compressing different data separately

Now, the FASTQ format intermixes the different data types, and since they are very different, this decreases the redundancy within a block of a given size and the efficiency of a dictionary. Let’s instead extract the different components, and compress them separately. And let’s ignore the +-header lines, they’re just repeating the @-headers. This gives us the following table:

Compression	Heads	Seqs	Quals	Relative
none	3790	9184	9136	85%
gzip	460	2691	3602	84%
bzip2	345	2534	3157	94%
lzma	254	2157	3126	86%

The percentages are the sum of sizes relative to the compressed whole file. Note that the whole file includes, essentially, two copies of the headers. Except for bzip2, which managed to compress the whole file very well, the gain is about proportional to the size of the removed headers. This is slightly depressing, since you’d expect a good compressor to eliminate the header redundancy anyway.

Ordering before compression

One downside here is that the order of the sequences is arbitrary. You’d expect compression ratios to increase if similar sequences could be placed close to each other. Let try a very simple heuristic for that, namely sorting the sequences:

Compression	Seqs	Relative
none	9184	100%
gzip	1721	64%
bzip2	1965	78%
lzma	1384	64%

Percentages are size relative to size of the compressed, unsorted sequences. Perhaps unsurprisingly, this helps quite a bit, especially for the dictionary-using compression algorithms.

Results

Compression	Heads	Seqs	Quals	Total	Relative
gzip	460	1721	3602	5783	22%
bzip2	345	2534	3157	6036	23%
lzma	254	1384	3126	4764	18%

The clear winner is lzma, it is better in all categories, and produces a total output that is about four fifths the size of the others. More surprising, perhaps, is that gzip produces better compression than bzip2, and it is faster and more ubiquitous.

Also interesting, by a straightforward rearranging of sequence data, we gain 33% compared to plain compression. A more advanced grouping/clustering might improve this even further, one possibility is aligning the sequences to a reference, and ordering them by alignment position. This would likely improve compressibility, but without introducing any explicit dependency on the reference.

comments powered by Disqus

Feedback? Please email ketil@malde.org.