Optimization week':' making Haskell go faster

by Ketil Malde; May 18, 2008

It seems to be optimization week on the haskell café mailing list. Inspired by a combination of Don Stewart’s blog post about how to optimize for speed and the sorry performance of my xml2x program, I thought this would be good time to see if things could be improved. In its current state, xml2x takes a few hours to process a few days worth of BLASTX output, so it’s far from critical, but faster is always better, and reading Don’s post, the intermediate output from the compiler, a.k.a. ghc core didn’t really look so scary.

Profiling with Haskell

First we need to identify the actual bottlenecks, using GHC’s profiling options. This means compiling the program with -prof -auto-all, and the Makefile shipped with xml2x already has a target for this. (I could probably do it with Cabal as well, but I never bothered to find out. Lazy me.) A bioinformatics library is that its Cabal file contains this line:

ghc-options:         -Wall -O2 -fexcess-precision -funbox-strict-fields -auto-all

The final auto-all ensures that when the library gets compiled with profiling support, cost centres will be assigned to its internal functions too, thus the profiling run will reveal where time is being spent in more detail. This practice is probably frowned upon by people who are into that kind of thing, but here it is crucial, as the profile reveals this:

Fri May 16 18:28 2008 Time and Allocation Profiling Report  (Final)

xml2x.p +RTS -p -sxml2x.p2.stats -RTS -v -C --reg --go-def=GO.terms_and_ids
    --anno=gene_association.goa_uniprot small.xml -o small3.csv

total time  =     4073.12 secs   (203656 ticks @ 20 ms)
total alloc = 1082,832,292,360 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

mkAnn                          Bio.Sequence.GOA        60.8   56.2
readXML                        Bio.Alignment.BlastXML  27.4   38.4
hsp2match                      Bio.Alignment.BlastXML   3.1    1.7
readGOA                        Bio.Sequence.GOA         1.9    0.4
hit2hit                        Bio.Alignment.BlastXML   1.6    1.8
protTerms                      Main                     1.6    0.0
sequence'                      Bio.Util                 1.1    0.2</pre>

In short, the function named mkAnn consumes almost two thirds of the time here, and is an obvious target for closer inspection. The function looks like this:

data Annotation = Ann !UniProtAcc !GoTerm !EvidenceCode deriving (Show)
newtype GoTerm = GO Int deriving (Eq,Ord)
type UniProtAcc = ByteString

mkAnn :: ByteString -> Annotation
mkAnn = pick . B.words
    where pick (_db:up:rest) = pick' up $ findGo rest
        pick' up' (go:_:ev:_) = Ann (copy up') (read $ unpack go) (read $ unpack ev)
        findGo = dropWhile (not . B.isPrefixOf (pack "GO:"))

This is a slightly clunky way of extracting a bit of information from a line of text; the text in question is GOA, a mapping between proteins and standard Gene Onthology terms. It’s a text file with about 20 million lines, so we use lazy bytestrings to deal with it. Each line is broken into words, the second word (“up”) is the protein name. Then a variable number of words later, the GO term itself appears, and the final item we want is the third word after the GO term. Note the copy, ensuring that we don’t keep around a large part of the input file.

So what’s wrong here? Asking in the café, Don suggested that unpacking bytestrings is a less than recommendable practice. Still, I find it is a useful idiom, and the read $ unpack s should garbage-collect the characters as soon as they are consumed. Ideally, the string should be deforested or fused away. I happily agree that this is a lazy programmer’s approach and not optimal - but on the other hand, it shouldn’t be that costly?

GHC core to the rescue - not.

This is the stage where I looked at GHC core. I guess it’s one of those things that looks easy when professionals do it, but when you get home and try it on your own, you just have no idea. The three-line function above resulted in four pages of cryptic core output. Call me lazy and ignorant, but I decided to take just one more little peek at the source first.

Let’s check out those read-unpack constructs. The read instance for GoTerm checks that the string starts with “GO:”, and then reads an Int. The second reads an EvidenceCode, and since these are defined as short abbreviations in all upper case, I just wrote a data type with corresponding nullary constructors, and derived Read for it.

data EvidenceCode = IC  -- Inferred by Curator
      | IDA -- Inferred from Direct Assay
      | IEA -- Inferred from Electronic Annotation
      | IEP -- Inferred from Expression Pattern
      | IGC -- Inferred from Genomic Context
      | IGI -- Inferred from Genetic Interaction
      | IMP -- Inferred from Mutant Phenotype
      | IPI -- Inferred from Physical Interaction
      | ISS -- Inferred from Sequence or Structural Similarity
      | NAS -- Non-traceable Author Statement
      | ND  -- No biological Data available
      | RCA -- inferred from Reviewed Computational Analysis
      | TAS -- Traceable Author Statement
      | NR  -- Not Recorded
   deriving (Read,Show,Eq)</pre>

Maybe this wasn’t such a good idea after all? Let’s try with custom, bytestring-based parsers for EvidenceCode and GoTerm:

getEC :: ByteString -> EvidenceCode
getEC s = case B.uncons s of
        Just ('I',s') -> case B.uncons s' of
                           Just ('C',_) -> IC
                           Just ('D',_) -> IDA
                           Just ('E',s'') -> case B.head s'' of 'A' -> IEA
                                                                'P' -> IEP
                                                                _ -> e 1
                           Just ('G',s'') -> case B.head s'' of 'C' ->; IGC
                                                                'I' -> IGI
                                                                _ -> e 2
                           Just ('M',_) -> IMP
                           Just ('P',_) -> IPI
                           Just ('S',_) -> ISS
                           _ -> e 3
        Just ('N',s') -> case B.head s' of 'A' -> NAS
                                           'D' -> ND
                                           'R' -> NR
                                           _ -> e 4
        Just ('R',_) -> RCA
        Just ('T',_) -> TAS
        _ -> e 5
where e :: Int -> a
      e n = error ("Illegal GO evidence code ("++show n++"): "++unpack s)

getGo :: ByteString -> GoTerm
getGo bs = GO $ fst $ maybe e id (B.readInt $ B.drop 3 bs)
    where e = error ("Unable to parse GO term"++unpack bs)</pre>

Time to re-run with profiling:

Sat May 17 19:25 2008 Time and Allocation Profiling Report  (Final)

xml2x.p +RTS -sxml2x.2pp.stats -p -RTS -v -C --reg --go-def=GO.terms_and_ids
--anno=gene_association.goa_uniprot small.xml -o small5p.csv

total time  =     2022.36 secs   (101118 ticks @ 20 ms)
total alloc = 622,807,651,828 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

readXML                        Bio.Alignment.BlastXML  48.6   66.7
mkAnn                          Bio.Sequence.GOA        29.8   23.6
hsp2match                      Bio.Alignment.BlastXML   5.2    3.0
readGOA                        Bio.Sequence.GOA         3.6    0.7
hit2hit                        Bio.Alignment.BlastXML   3.2    3.0
protTerms                      Main                     3.1    0.0
sequence'                      Bio.Util                 2.1    0.4
countIO                        Bio.Util                 1.4    0.6
getFrom                        Bio.Alignment.BlastXML   1.1    0.7</pre>

We see that mkAnn has fallen way behind the XML parsing in time consumption, and until Neil gets around to do a bytestring version of his otherwise excellent tagsoup library, there isn’t all that much left to do. Total time has been cut in half, from over an hour to 35 minutes or so. There’s still 30% of the run time to be shaved even more closely, perhaps you can suggest the remaining culprit? My guess would be the findGo function. How should it be rewritten?

Lessons learned

Profiling is nice to pinpoint the hotspots. Bytestrings is the way to go for fast code. GHC generates crappy derived parsers for data types, perhaps especially so for data types with many constructors. Writing manual parsers is tedious, but a bit of tedium can go a long way.

Unfortunatly, I am still not entirely happy with xml2x. The profiling numbers lie, as they don’t – I believe – include GC time. This particular program uses an inordinate amount of time collecting garbage. I suspect it is because I store GO terms as 16 bit integers in unboxed arrays in an insufficiently strict Data.Map, but I will have to investigate this closer.

But not today.

comments powered by Disqus
Feedback? Please email ketil@malde.org.