Parsing ints
A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers. Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the quality data that often accompanies Fasta sequence files. When investigating one of my programs that seemed a bit on the slow side, I discovered that although parsing Fasta sequence files (with readFasta from Bio.Sequence.Fasta) processes at a rate in excess of 200MB/s on my laptop, reading sequences with quality data (using readFastaQual) was much slower, about 2-3 MB/s. After some investigation and a few rewrites, it’s up to about 15MB/s, but still pretty far from plain sequence. Below are three (and a half) different versions, and the hope somebody can improve on it even furter.
The original version
I’ve taken the liberty of cleaning things up a bit, and removing some context that I sure hope isn’t necessery. Bascially, the task is to take a list of ByteString input lines consisting of whitespace separated decimal integers, and build a ByteString consisting of single byte quality values corresponding to those integers.
Below is how the naive quality parsing function might look, unpacking each word to String in order to use read. Note that B is lazy ByteString.Char8, BB is lazy ByteString, that is, the Word8 version.
BB.pack $ map (read . B.unpack) $ B.words $ B.unlines ls
We don’t expect this to do terribly well, and I guess it’s no surprise when this parses my test file of about 10MB in 24 seconds.
Improved versions
The key to improved performance is first and foremost to avoid the unpacking and parsing of strings. Thankfully, the ByteString library provides a readInt function.
BB.pack [lookup x | x <- concatMap B.words ls]
where
lookup x = case B.readInt x of Just (v,_) -> fromIntegral v
Nothing -> error "Unparsable qual value"
This isn’t a lot more complicated than our initial attempt, but the speed increas is considerable: slightly less than 2 seconds for the test file, more than a tenfold improvement. The ByteString implementation will share the storage of the separate words with the original strings, but since readInt gives us back the rest of the string in addition to the parsed integer, we might as well make use of it:
BB.pack $ readInts $ B.unlines ls
where readInts xs = case B.readInt xs of
Just (i,rest) -> fromIntegral i : readInts (B.dropWhile isSpace rest)
Nothing -> []
This turns out to be a bit faster, time is now 1.6 seconds. Another 20% shaved off.
Final version
We’re really only interested in Word8 values, since quality values always are small, and since that’s what gets encoded in the result anyway. The previous versions takes a detour by reading Ints and using fromIntegral to convert them to the desired size. It bears noting that there is no error checking involved, fromIntegral will happily and silently truncate any number beyond its target range. So lets do things explicitly, using Word8s throughout the computation:
BB.pack $ go 0 ls
where
isDigit x = x <= 58 && x >= 48
go i (s:ss) = case BB.uncons s of
Just (c,rs) -> if isDigit c then go (c - 48 + 10*i) (rs:ss)
else let rs' = BB.dropWhile (not . isDigit) rs
in if BB.null rs' then i : go 0 ss else i : go 0 (rs':ss)
Nothing -> i : go 0 ss
go _ [] = []
This is the fastest one so far, clocking in at 0.94 seconds, over 40% faster than the best readInt version, and about 25 times faster than the naive version. Still, 10MB/s is well below the average hard disk.
So is there more room for improvement? The most obvious wart to me is the rather artificial splitting into lines. This is mostly an artifact of some early design desicions, and it should be possible to eliminate the splitting earlier on and saving even more time by simplify this function quite a bit.
If you spot anything else, or have suggestions, I (and my darcs repo) am all ears.
Edit: Since some people have asked, I’ve wrapped up a simple test program along with some test files at http://malde.org/~ketil/biohaskell/qualparsetest. This is a simplified version, if you want to be really helpful, you could always look at Bio.Sequence.Fasta in the Bioinformatics library and see if you can speed up e.g. dephd -i input.fasta input.qual -F /dev/null.
Leave a Reply
You must be logged in to post a comment.