Optimization week':' making Haskell go faster
by Ketil Malde; May 18, 2008
It seems to be optimization week on the haskell café mailing list. Inspired by a combination of Don Stewart’s blog post about how to optimize for speed and the sorry performance of my xml2x program, I thought this would be good time to see if things could be improved. In its current state, xml2x takes a few hours to process a few days worth of BLASTX output, so it’s far from critical, but faster is always better, and reading Don’s post, the intermediate output from the compiler, a.k.a. ghc core didn’t really look so scary.
Profiling with Haskell
First we need to identify the actual bottlenecks, using GHC’s profiling options. This means compiling the program with -prof -auto-all
, and the Makefile
shipped with xml2x already has a target for this. (I could probably do it with Cabal as well, but I never bothered to find out. Lazy me.) A bioinformatics library is that its Cabal file contains this line:
ghc-options: -Wall -O2 -fexcess-precision -funbox-strict-fields -auto-all
The final auto-all
ensures that when the library gets compiled with profiling support, cost centres will be assigned to its internal functions too, thus the profiling run will reveal where time is being spent in more detail. This practice is probably frowned upon by people who are into that kind of thing, but here it is crucial, as the profile reveals this:
Fri May 16 18:28 2008 Time and Allocation Profiling Report (Final)
xml2x.p +RTS -p -sxml2x.p2.stats -RTS -v -C --reg --go-def=GO.terms_and_ids
--anno=gene_association.goa_uniprot small.xml -o small3.csv
total time = 4073.12 secs (203656 ticks @ 20 ms)
total alloc = 1082,832,292,360 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
mkAnn Bio.Sequence.GOA 60.8 56.2
readXML Bio.Alignment.BlastXML 27.4 38.4
hsp2match Bio.Alignment.BlastXML 3.1 1.7
readGOA Bio.Sequence.GOA 1.9 0.4
hit2hit Bio.Alignment.BlastXML 1.6 1.8
protTerms Main 1.6 0.0
sequence' Bio.Util 1.1 0.2</pre>
In short, the function named mkAnn consumes almost two thirds of the time here, and is an obvious target for closer inspection. The function looks like this:
data Annotation = Ann !UniProtAcc !GoTerm !EvidenceCode deriving (Show)
newtype GoTerm = GO Int deriving (Eq,Ord)
type UniProtAcc = ByteString
mkAnn :: ByteString -> Annotation
mkAnn = pick . B.words
where pick (_db:up:rest) = pick' up $ findGo rest
pick' up' (go:_:ev:_) = Ann (copy up') (read $ unpack go) (read $ unpack ev)
findGo = dropWhile (not . B.isPrefixOf (pack "GO:"))
This is a slightly clunky way of extracting a bit of information from a line of text; the text in question is GOA, a mapping between proteins and standard Gene Onthology terms. It’s a text file with about 20 million lines, so we use lazy bytestrings to deal with it. Each line is broken into words, the second word (“up”) is the protein name. Then a variable number of words later, the GO term itself appears, and the final item we want is the third word after the GO term. Note the copy
, ensuring that we don’t keep around a large part of the input file.
So what’s wrong here? Asking in the café, Don suggested that unpacking bytestrings is a less than recommendable practice. Still, I find it is a useful idiom, and the read $ unpack s
should garbage-collect the characters as soon as they are consumed. Ideally, the string should be deforested or fused away. I happily agree that this is a lazy programmer’s approach and not optimal - but on the other hand, it shouldn’t be that costly?
GHC core to the rescue - not.
This is the stage where I looked at GHC core. I guess it’s one of those things that looks easy when professionals do it, but when you get home and try it on your own, you just have no idea. The three-line function above resulted in four pages of cryptic core output. Call me lazy and ignorant, but I decided to take just one more little peek at the source first.
Let’s check out those read-unpack constructs. The read instance for GoTerm
checks that the string starts with “GO:”, and then reads an Int
. The second reads an EvidenceCode, and since these are defined as short abbreviations in all upper case, I just wrote a data type with corresponding nullary constructors, and derived Read
for it.
data EvidenceCode = IC -- Inferred by Curator
| IDA -- Inferred from Direct Assay
| IEA -- Inferred from Electronic Annotation
| IEP -- Inferred from Expression Pattern
| IGC -- Inferred from Genomic Context
| IGI -- Inferred from Genetic Interaction
| IMP -- Inferred from Mutant Phenotype
| IPI -- Inferred from Physical Interaction
| ISS -- Inferred from Sequence or Structural Similarity
| NAS -- Non-traceable Author Statement
| ND -- No biological Data available
| RCA -- inferred from Reviewed Computational Analysis
| TAS -- Traceable Author Statement
| NR -- Not Recorded
deriving (Read,Show,Eq)</pre>
Maybe this wasn’t such a good idea after all? Let’s try with custom, bytestring-based parsers for EvidenceCode and GoTerm:
getEC :: ByteString -> EvidenceCode
getEC s = case B.uncons s of
Just ('I',s') -> case B.uncons s' of
Just ('C',_) -> IC
Just ('D',_) -> IDA
Just ('E',s'') -> case B.head s'' of 'A' -> IEA
'P' -> IEP
_ -> e 1
Just ('G',s'') -> case B.head s'' of 'C' ->; IGC
'I' -> IGI
_ -> e 2
Just ('M',_) -> IMP
Just ('P',_) -> IPI
Just ('S',_) -> ISS
_ -> e 3
Just ('N',s') -> case B.head s' of 'A' -> NAS
'D' -> ND
'R' -> NR
_ -> e 4
Just ('R',_) -> RCA
Just ('T',_) -> TAS
_ -> e 5
where e :: Int -> a
e n = error ("Illegal GO evidence code ("++show n++"): "++unpack s)
getGo :: ByteString -> GoTerm
getGo bs = GO $ fst $ maybe e id (B.readInt $ B.drop 3 bs)
where e = error ("Unable to parse GO term"++unpack bs)</pre>
Time to re-run with profiling:
Sat May 17 19:25 2008 Time and Allocation Profiling Report (Final)
xml2x.p +RTS -sxml2x.2pp.stats -p -RTS -v -C --reg --go-def=GO.terms_and_ids
--anno=gene_association.goa_uniprot small.xml -o small5p.csv
total time = 2022.36 secs (101118 ticks @ 20 ms)
total alloc = 622,807,651,828 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
readXML Bio.Alignment.BlastXML 48.6 66.7
mkAnn Bio.Sequence.GOA 29.8 23.6
hsp2match Bio.Alignment.BlastXML 5.2 3.0
readGOA Bio.Sequence.GOA 3.6 0.7
hit2hit Bio.Alignment.BlastXML 3.2 3.0
protTerms Main 3.1 0.0
sequence' Bio.Util 2.1 0.4
countIO Bio.Util 1.4 0.6
getFrom Bio.Alignment.BlastXML 1.1 0.7</pre>
We see that mkAnn
has fallen way behind the XML parsing in time consumption, and until Neil gets around to do a bytestring version of his otherwise excellent tagsoup library, there isn’t all that much left to do. Total time has been cut in half, from over an hour to 35 minutes or so. There’s still 30% of the run time to be shaved even more closely, perhaps you can suggest the remaining culprit? My guess would be the findGo
function. How should it be rewritten?
Lessons learned
Profiling is nice to pinpoint the hotspots. Bytestrings is the way to go for fast code. GHC generates crappy derived parsers for data types, perhaps especially so for data types with many constructors. Writing manual parsers is tedious, but a bit of tedium can go a long way.
Unfortunatly, I am still not entirely happy with xml2x
. The profiling numbers lie, as they don’t – I believe – include GC time. This particular program uses an inordinate amount of time collecting garbage. I suspect it is because I store GO terms as 16 bit integers in unboxed arrays in an insufficiently strict Data.Map, but I will have to investigate this closer.
But not today.
comments powered by Disqus