454 sequencing and parsing the SFF binary format

by Ketil Malde; November 14, 2008

Roche’s 454 sequencing technology can produce biological sequence data on a scale that exceeds traditional Sanger sequencing by orders of magnitude. Due to the fundamentally different method used to generate the sequences, we would like to investigate the raw data and see if we can quantify – and maybe also reduce the number or severity of – errors. This means reading the binary SFF format. Below, we’ll dissect the SFF format, and describe a Haskell implementation.

The SFF file format is documented in the GS FLX documentation, page 445-448. There also exist an open C implementation in io_lib from the Staden package, in addition to the proprietary implementation provided to Roche’s customers bundled with the sequencing machine.

Structure and implementation

The SFF format is a relatively straightforward one, consisting of a header (the common header), followed by a number of read blocks corresponding to the individual sequences. Each read block contains a header block and a data block. There is also an optional index, whose format is not defined as part of the SFF format.

Here’s the direct translation to Haskell:

data SFF = SFF CommonHeader [ReadBlock]

data CommonHeader = CommonHeader {
      magic                                   :: Word32
    , version                                 :: Word32
    , index_offset                            :: Int64
    , index_length, num_reads                 :: Int32
    , cheader_length, key_length, flow_length :: Int16
    , flowgram_fmt                            :: Word8
    , flow, key                               :: ByteString
    }

data ReadHeader = ReadHeader {
      rheader_length, name_length           :: Int16
    , num_bases                             :: Int32
    , clip_qual_left, clip_qual_right
    , clip_adapter_left, clip_adapter_right :: Int16
    , read_name                             :: ByteString
    }

data ReadBlock = ReadBlock {
     read_header                :: ReadHeader
     -- The data block
     , flowgram                   :: [Int16]
     , flow_index, bases, quality :: ByteString
     }

We can almost get by with these data structures and the straightforward Binary instances. One slightly complicating feature is that each block must be aligned to an 8-byte boundary. Another one is that we will in some places need some information (lengths and counts) from previously read data, as none of the arrays are size-prefixed.

Note that the data structures above represent the data on disk very closely, later on, static items like the magic number and version will be removed and hardwired into the code instead.

Testing and optimization

While I love to do automated unit testing with QuickCheck, it’s not so straightforward to generate random instances for complex structures like flowgrams. So in order to test the program, I added two functions to the library: ‘test’ and ‘convert’. The former just prints the information from the CommonHeader and the two first ReadBlocks. I can then use sffinfo or a similar tool to check that the two programs are in agreement. The latter reads an SFF file building the SFF data structure, and then serializes it back to a file (modulo the index).

I also wrote a small application, ‘flower’, that reads SFF files and provides various outputs. The first one is just a table with all the flow values, which will be useful for statistical analysis of these data. This immediately revealed a performance problem. (The file contains about 120 000 sequences, each with 168 flow values):

./flower ../biolib/DZX0XNV01.sff > /dev/null 265.68s user 1.08s system 99% cpu 4:28.76 total

The good news is that reading the SFF file is quite fast, the bad news is that formatting the information takes a relatively large amount of time. Here’s the offending code:

showread :: CommonHeader -> ReadBlock -> [String]
showread h rd = let rn = unpack (read_name $ read_header rd)
                in map ((p,c,v) -> printf "%st%dt%st%1.2f" rn p [c] (fi v))
                   $ zip3 [(1::Int)..] (unpack $ flow h) (flowgram rd)

fi :: Int16 -> Double
fi = (/100) . fromIntegral

The main culprit turned out to be ‘printf’, which I suspect needs to interpret the format string every time. Also, we’ll need a way to format floating point values.

showread :: CommonHeader -> ReadBlock -> [String]
showread h rd = let rn = unpack (read_name $ read_header rd)
                in map ((p,c,v) -> concat [rn,"t",show p,"t",[c],"t",fi v]) -- printf "%st%dt%st%1.2f" rn p [c] (fi v))
                   $ zip3 [(1::Int)..] (unpack $ flow h) (flowgram rd)

fi :: Int16 -> String
fi = (x -> showFFloat (Just 2) x "") . (/100) . fromIntegral

Running this gives us:

./flower ../biolib/DZX0XNV01.sff > /dev/null 192.65s user 0.69s system 99% cpu 3:15.01 total

Of course, we’re still building up Strings – that is, linked lists of unicode characters, which is very often a performance problem. Let’s try the universal string fastifier: Data.Bytestring. And while we could add a ‘pack’ to the floating point formatting, we’ll try to build it directly:

showread :: CommonHeader -> ReadBlock -> [ByteString]
showread h rd = let rn = read_name $ read_header rd
                in map ((p,c,v) -> B.concat [rn,t,B.pack (show p),t,B.pack [c],t,fi v])
                   $ zip3 [(1::Int)..] (unpack $ flow h) (flowgram rd)

fi :: Int16 -> ByteString
fi i = let (a,x) = i `divMod` 1000
           (b,y) = x `divMod` 100
           (c,d) = y `divMod` 10
           mkdigit = chr . (+48) . fromIntegral
       in B.pack [mkdigit a,mkdigit b,'.',mkdigit c,mkdigit d]

The variable t is just a bytestring tab. Let’s benchmark it again:

./flower ../biolib/DZX0XNV01.sff > /dev/null 55.70s user 0.35s system 99% cpu 56.217 total

It turns out that floating point formatting is still a problem. Let’s replace it with a table lookup:

fi = (!) farray

farray :: Array Int16 ByteString
farray = listArray (0,10000) [B.pack (showFFloat (Just 2) i "") | i <- [0,0.01..99.99::Double]]

This is about as fast as it gets:

./flower -f ../biolib/DZX0XNV01.sff > /dev/null 39.08s user 0.26s system 99% cpu 39.573 total

Note that it generates about half a gigabyte of output, or a rate of more than 10MB/s – faster than wc can count it. Flower can also extract the reads (Fasta format, takes 0.7 seconds, approximately 200 000 seqeunces per second), reads with quality (Fastq format, about same speed), or a histogram of flow value frequencies (somewhat slower at 2.2 seconds).

Availability

The bioinformatics library and the flower source code are available here, and there’s also a linux executable for flower, which I may be updating as development progresses.

comments powered by Disqus

Feedback? Please email ketil@malde.org.