BHLog

bioinformatics and haskell

24  10 2008

Optimization again: befuddled by bytestrings

I’ve been spending the last couple of weeks working on an indexing scheme for sequences, using Bryan O’Sullivan’s Bloom filters.  Now, it turned out that when Bryan tested out the code, he found a curious problem:  Apparently, the indexing stage scaled quadratically with sequence length.  This wouldn’t have been so strange, were it not for the fact that I saw the expected linear time usage when I ran the code.  Some more digging about revealed that my laptop also showed quadratic scaling.  Profiling showed the  culprit to be a simple pipeline-style function:

swords s = take (fromIntegral (seqlength s)+1-k) . map (B.take k) . B.tails . B.map toUpper $ seqdata s

Here, seqdata returns a lazy bytestring, which is also what’s hiding behind the B. qualifier.  Basically, this just builds the list of all lenght-k substrings.  This should, in my opinion, stream nicely, and run in constant space and linear time.  What on earth makes this quadratic?  On only some systems?   Time to dissect the systems in question:

Comparing our environments, one obvious difference was that the fast system was built using GHC 6.8.1, while the slow ones were 6.8.2.  So we checked with a 6.8.1 snapshot - no difference, still slow.  We checked the various source repositories involved for some stray patch that somehow had avoided distribution - nothing turned up. Comparing Cabal setup files didn’t reveal anything obvious, except that the newer Cabal emitted a lot more information.

Somewhere, something had to be different. Could it be bloomfilter not being a good consumer for the generated words?  The bio library messing up FASTA parsing?  Something else entirely?

Looking at the swords function, it looks like a good candidate for deforestation and/or fusion.  Could there be some optimization not being performed in some cases, for some strange reasons?  Since I remember fusion being mentioned occasionally in the context of bytestrings, I checked the libraries again, and found something I’d missed previously:  Bytestring 0.9.1 on the fast system, 0.9.0.1 on the slow ones.  While the announcement didn’t promise more than a few percent better performance, no stone could be left unturned.  And this proved to be it: after upgrading bytestring, and recompiling binary, biolib, and bloomfilter to use it, my laptop was as fast as the server.

I’m not sure exactly what caused this, but apparently there were some issues with bytestring fusion, and fusion is disabled in current bytestring versions. At any rate, it seems the newer version is safer, this is now the minimum requirement in bio.cabal.


Comments are closed.

« The FastQ file format for sequences 454 sequencing and parsing the SFF binary format »