The FASTQ file format for sequences

by Ketil; September 9, 2008

It was just brought to my attention that people have started to use a new file format for sequences. This format, called ‘FastQ’ combines both the sequence data itself and the quality data in one file. That’s a nice idea, and I implemented support for it, tests, docs and all, in the bio library. Runs fast, too. Basically, the format is a sequence of records, each one similar to this:

@{sequence header}
{sequence data}
+{sequence header}
{quality data}

Note that the sequence header is repeated in there, apparently somebody thought that would be a good idea. The {sequence data} part looks like it does in a Fasta file, except that here it has to be on a single line. The {quality data} is ASCII, each letter representing the quality value 33 lower than it’s ASCII value. This opens up another possibility of getting it wrong, since the line of quality data can (and will!) start out with + or @ occasionally.

Anyway, the implementation seems to be pretty efficient, I wrote a simple program to count the number of sequences in a file:

module Main where

import Bio.Sequence
import System

main = do
   [f] <- getArgs
   print . length =<<; readFastQ f

Testing it on this 440MB file ran in (first cold, then warm cache):

./countFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  1.77s user 0.41s system 53% cpu 4.091 total
./countFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  1.26s user 0.19s system 89% cpu 1.631 total

For comparison, I also tried it with grep:

grep '^@' ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  5.61s user 0.46s system 98% cpu 6.181 total

..and in addition to being 50% slower, it gives the wrong answer, since the @ delimiter may occur in the quality data. Another nice thing is that the Haskell program is IO bound, so it would be even faster if I had a better disk. (Note to self: talk to boss about getting an SSD for my laptop).

Update: I did a quick test, comparing it to the old Fasta format. First, to convert, I replaced the last line in the program above with

readFastQ f >>= writeFasta (f++".fasta")

and compiled it as convertFQ. Running this:

 ./convertFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq 9.51s user 2.76s system 40% cpu 30.037 total

Then, I made a countFa by changing the last line to:

print . length =<< readFasta f

Running this on the Fasta file generated just now (282MB), I get:

./countFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq 1.26s user 0.19s system 89% cpu 1.631 total

Here, grep takes 2.25 seconds user time (and gives the correct answer), we’re still faster.

The only cloud on the horizon is that there is some disagreement about the format. For example, somebody thinks quality always should start with a ‘!’ (viz zero). Maq developers think it’s okay to drop the repeated sequence name. Rockefeller thinks the quality data should be text digits, like the Qual format. And the good people at Solexa had to go and slightly alter…not the format itself, but it’s interpretation, using a different formula to calculate the error probabilities from the quality values. So basically, given a file, there’s no way to know whether it uses Solexa-style quality information, or regular, Phred-style quality. If you get quality values above 60, then maybe you’re interpreting it wrong. Then again, maybe not. Sigh.

comments powered by Disqus

Feedback? Please email ketil@malde.org.