<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Biohaskell &#187; Optimization</title>
	<atom:link href="http://blog.malde.org/index.php/category/optimization/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.malde.org</link>
	<description>bioinformatics and haskell</description>
	<lastBuildDate>Tue, 20 Jul 2010 15:04:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Parsing ints</title>
		<link>http://blog.malde.org/index.php/2009/08/31/parsing-ints/</link>
		<comments>http://blog.malde.org/index.php/2009/08/31/parsing-ints/#comments</comments>
		<pubDate>Mon, 31 Aug 2009 13:06:35 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/08/31/parsing-ints/</guid>
		<description><![CDATA[A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the quality data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://upload.wikimedia.org/wikipedia/commons/f/ff/Cempedak_opened1.JPG" alt="Artocarpus integer" align="right" width="376" height="250" />A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the <em>quality</em> data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on the slow side, I discovered that although parsing Fasta sequence files (with <strong>readFasta</strong> from <strong>Bio.Sequence.Fasta</strong>) processes at a rate in excess of 200MB/s on my laptop, reading sequences with quality data (using <strong>readFastaQual</strong>) was much slower, about 2-3 MB/s.   After some investigation and a few rewrites, it&#8217;s up to about 15MB/s, but still pretty far from plain sequence.  Below are three (and a half) different versions, and the hope somebody can improve on it even furter.</p>
<p><span id="more-48"></span></p>
<h3>The original version</h3>
<p>I&#8217;ve taken the liberty of cleaning things up a bit, and removing some context that I sure hope isn&#8217;t necessery.  Bascially, the task is to take a list of ByteString input lines consisting of whitespace separated decimal integers, and build a ByteString consisting of single byte quality values corresponding to those integers.</p>
<p>Below is how the naive quality parsing function might look, unpacking each word to <strong>String</strong> in order to use <strong>read</strong>.  Note that <strong>B</strong> is lazy ByteString.Char8, <strong>BB</strong> is lazy ByteString, that is, the <strong>Word8</strong> version.</p>
<p><tt> </tt></p>
<pre><tt>BB.pack $ map (read . B.unpack) $ B.words $ B.unlines ls
</tt></pre>
<p><tt></tt></p>
<p>We don&#8217;t expect this to do terribly well, and I guess it&#8217;s no surprise when this parses my test file of about 10MB in 24 seconds.</p>
<h3>Improved versions</h3>
<p>The key to improved performance is first and foremost to avoid the unpacking and parsing of strings.  Thankfully, the ByteString library provides a <strong>readInt</strong> function.</p>
<p><tt> </tt></p>
<pre><tt>
BB.pack [lookup x | x &lt;- concatMap B.words ls]
    where
    lookup x = case B.readInt x of Just (v,_) -&gt; fromIntegral v
                                   Nothing -&gt; error "Unparsable qual value"
</tt></pre>
<p>This isn&#8217;t a lot more complicated than our initial attempt, but the speed increas is considerable: slightly less than 2 seconds for the test file, more than a tenfold improvement.  The ByteString implementation will share the storage of the separate words with the original strings, but since <strong>readInt</strong> gives us back the rest of the string in addition to the parsed integer, we might as well make use of it:</p>
<pre><tt>
BB.pack $ readInts $ B.unlines ls
    where readInts xs = case B.readInt xs of </tt></pre>
<pre><tt>                          Just (i,rest) -&gt; fromIntegral i : readInts (B.dropWhile isSpace rest)
                          Nothing -&gt; []
</tt></pre>
<p>This turns out to be a bit faster, time is now 1.6 seconds.  Another 20% shaved off.</p>
<h3>Final version</h3>
<p>We&#8217;re really only interested in <strong>Word8</strong> values, since quality values always are small, and since that&#8217;s what gets encoded in the result anyway.  The previous versions takes a detour by reading <strong>Int</strong>s and using <strong>fromIntegral</strong> to convert them to the desired size.  It bears noting that there is no error checking involved, <strong>fromIntegral</strong> will happily and silently truncate any number beyond its target range.  So lets do things explicitly, using <strong>Word8</strong>s throughout the computation:<br />
<tt> </tt></p>
<pre><tt>
BB.pack $ go 0 ls
    where
    isDigit x = x &lt;= 58 &amp;&amp; x &gt;= 48
    go i (s:ss) = case BB.uncons s of </tt></pre>
<pre><tt>                    Just (c,rs) -&gt; if isDigit c then go (c - 48 + 10*i) (rs:ss)
                                   else let rs' = BB.dropWhile (not . isDigit) rs
                                        in if BB.null rs' then i : go 0 ss else i : go 0 (rs':ss)
                    Nothing -&gt; i : go 0 ss
    go _ [] = []
</tt></pre>
<p>This is the fastest one so far, clocking in at 0.94 seconds, over 40% faster than the best <strong>readInt</strong> version, and about 25 times faster than the naive version.  Still, 10MB/s is well below the average hard disk.</p>
<p>So is there more room for improvement?  The most obvious wart to me is the rather artificial splitting into lines.  This is mostly an artifact of some early design desicions, and it should be possible to eliminate the splitting earlier on and saving even more time by simplify this function quite a bit.</p>
<p>If you spot anything else, or have suggestions, I (and my darcs repo) am all ears.</p>
<p><strong>Edit:</strong> Since some people have asked, I&#8217;ve wrapped up a simple test program along with some test files at <a href="http://malde.org/~ketil/biohaskell/qualparsetest ">http://malde.org/~ketil/biohaskell/qualparsetest</a>.  This is a simplified version, if you want to be <em>really</em> helpful, you could always look at <strong>Bio.Sequence.Fasta</strong> in the <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">Bioinformatics library</a> and see if you can speed up e.g. <em>dephd -i input.fasta input.qual -F /dev/null</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/08/31/parsing-ints/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>454 sequencing and parsing the SFF binary format</title>
		<link>http://blog.malde.org/index.php/2008/11/14/454-sequencing-and-parsing-the-sff-binary-format/</link>
		<comments>http://blog.malde.org/index.php/2008/11/14/454-sequencing-and-parsing-the-sff-binary-format/#comments</comments>
		<pubDate>Fri, 14 Nov 2008 13:13:33 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/11/14/454-sequencing-and-parsing-the-sff-binary-format/</guid>
		<description><![CDATA[Roche&#8217;s 454 sequencing technology can produce biological sequence data on a scale that exceeds traditional Sanger sequencing by orders of magnitude.   Due to the fundamentally different method used to generate the sequences, we would like to investigate the raw data and see if we can quantify &#8212; and maybe also reduce the number or severity [...]]]></description>
			<content:encoded><![CDATA[<p>Roche&#8217;s 454 sequencing technology can produce biological sequence data on a scale that exceeds traditional Sanger sequencing by orders of magnitude.   Due to the fundamentally different method used to generate the sequences, we would like to investigate the raw data and see if we can quantify &#8212; and maybe also reduce the number or severity of &#8212; errors.  This means reading the binary SFF format. Below, we&#8217;ll dissect the SFF format, and describe a Haskell implementation.</p>
<p><span id="more-30"></span></p>
<p>The SFF file format is documented in the <a href="http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf" title="GS_FLX manual">GS FLX documentation</a><a href="http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf" title="GS 20 documentation"></a>, page 445-448.  There also exist an open C implementation in io_lib from the  <a href="http://staden.sourceforge.net/" title="Staden package home page at sourceforge">Staden package</a>, in addition to the proprietary implementation provided to Roche&#8217;s customers bundled with the sequencing machine.</p>
<h2>structure and implementation</h2>
<p>The SFF format is a relatively straightforward one, consisting of a header (the<em> common header</em>), followed by a number of <em>read blocks</em> corresponding to the individual sequences.  Each read block contains a <em>header block</em> and a<em> data block</em>. There is also an optional index, whose format is not defined as part of the SFF format.</p>
<p>Here&#8217;s the direct translation to Haskell:</p>
<blockquote>
<pre>
data SFF = SFF CommonHeader [ReadBlock]

data CommonHeader = CommonHeader {
          magic                                   :: Word32
        , version                                 :: Word32
        , index_offset                            :: Int64
        , index_length, num_reads                 :: Int32
        , cheader_length, key_length, flow_length :: Int16
        , flowgram_fmt                            :: Word8
        , flow, key                               :: ByteString
     }

data ReadHeader = ReadHeader {
      rheader_length, name_length           :: Int16
    , num_bases                             :: Int32
    , clip_qual_left, clip_qual_right
    , clip_adapter_left, clip_adapter_right :: Int16
    , read_name                             :: ByteString
}

data ReadBlock = ReadBlock {
      read_header                :: ReadHeader
    -- The data block
    , flowgram                   :: [Int16]
    , flow_index, bases, quality :: ByteString
    }</pre>
</blockquote>
<p>We can almost get by with these data structures and the straightforward Binary instances.  One slightly complicating feature is that each block must be aligned to an 8-byte boundary.  Another one is that we will in some places need some information (lengths and counts) from previously read data, as none of the arrays are size-prefixed.</p>
<p>Note that the data structures above represent the data on disk very closely, later on, static items like the magic number and version will be removed and hardwired into the code instead.</p>
<h2>Testing and optimization</h2>
<p>While I love to do automated unit testing with QuickCheck, it&#8217;s not so straightforward to generate random instances for complex structures like flowgrams.  So in order to test the program, I added two functions to the library: &#8216;test&#8217; and &#8216;convert&#8217;. The former just prints the information from the CommonHeader and the two first ReadBlocks.  I can then use sffinfo or a similar tool to check that the two programs are in agreement.   The latter reads an SFF file building the SFF data structure, and then serializes it back to a file (modulo the index).</p>
<p>I also wrote a small application, &#8216;flower&#8217;, that reads SFF files and provides various outputs.  The first one is just a table with all the flow values, which will be useful for statistical analysis of these data.  This immediately revealed a performance problem.    (The file contains about 120 000 sequences, each with 168 flow values):</p>
<pre>
./flower ../biolib/DZX0XNV01.sff &gt; /dev/null  265.68s user 1.08s system 99% cpu 4:28.76 total</pre>
<p>The good news is that reading the SFF file is quite fast, the bad news is that formatting the information takes a relatively large amount of time.  Here&#8217;s the offending code:</p>
<blockquote>
<pre>
showread :: CommonHeader -&gt; ReadBlock -&gt; [String]
showread h rd = let rn = unpack (read_name $ read_header rd)
in map ((p,c,v) -&gt; printf "%st%dt%st%1.2f" rn p [c] (fi v))
$ zip3 [(1::Int)..] (unpack $ flow h) (flowgram rd)

fi :: Int16 -&gt; Double
fi = (/100) . fromIntegral</pre>
</blockquote>
<p>The main culprit turned out to be &#8216;printf&#8217;, which I suspect needs to interpret the format string every time.  Also, we&#8217;ll need a way to format floating point values.</p>
<blockquote>
<pre>
showread :: CommonHeader -&gt; ReadBlock -&gt; [String]
showread h rd = let rn = unpack (read_name $ read_header rd)
in map ((p,c,v) -&gt; concat [rn,"t",show p,"t",[c],"t",fi v]) -- printf "%st%dt%st%1.2f" rn p [c] (fi v))
$ zip3 [(1::Int)..] (unpack $ flow h) (flowgram rd)

fi :: Int16 -&gt; String
fi = (x -&gt; showFFloat (Just 2) x "") . (/100) . fromIntegral</pre>
</blockquote>
<p>Running this gives us:</p>
<pre>
./flower ../biolib/DZX0XNV01.sff &gt; /dev/null  192.65s user 0.69s system 99% cpu 3:15.01 total</pre>
<p>Of course,  we&#8217;re still building up Strings &#8212; that is, linked lists of unicode characters, which is very often a performance problem.  Let&#8217;s try the universal string fastifier: Data.Bytestring.  And while we could add a &#8216;pack&#8217; to the floating point formatting, we&#8217;ll try to build it directly:</p>
<blockquote>
<pre>
showread :: CommonHeader -&gt; ReadBlock -&gt; [ByteString]
showread h rd = let rn = read_name $ read_header rd
in map ((p,c,v) -&gt; B.concat [rn,t,B.pack (show p),t,B.pack [c],t,fi v])
$ zip3 [(1::Int)..] (unpack $ flow h) (flowgram rd)

fi :: Int16 -&gt; ByteString
fi i = let (a,x) = i `divMod` 1000
(b,y) = x `divMod` 100
(c,d) = y `divMod` 10
mkdigit = chr . (+48) . fromIntegral
in B.pack [mkdigit a,mkdigit b,'.',mkdigit c,mkdigit d]</pre>
</blockquote>
<p>The variable t is just a bytestring tab.  Let&#8217;s benchmark it again:</p>
<pre>
./flower ../biolib/DZX0XNV01.sff &gt; /dev/null  55.70s user 0.35s system 99% cpu 56.217 total</pre>
<p>It turns out that floating point formatting is still a problem.   Let&#8217;s replace it with a table lookup:</p>
<blockquote>
<pre>
fi = (!) farray

farray :: Array Int16 ByteString
farray = listArray (0,10000) [B.pack (showFFloat (Just 2) i "") | i &lt;- [0,0.01..99.99::Double]]</pre>
</blockquote>
<p>This is about as fast as it gets:</p>
<pre>
./flower -f ../biolib/DZX0XNV01.sff &gt; /dev/null  39.08s user 0.26s system 99% cpu 39.573 total</pre>
<p>Note that it generates about half a gigabyte of output, or a rate of more than 10MB/s &#8212; faster than wc can count it.  Flower can also extract the reads (Fasta format, takes 0.7 seconds, approximately 200 000 seqeunces per second), reads with quality (Fastq format, about same speed), or a histogram of flow value frequencies (somewhat slower at 2.2 seconds).</p>
<h2>Availability</h2>
<p>The bioinformatics library and the flower source code are available <a href="http://malde.org/~ketil/biohaskell/" title="darcs repositories">here </a>, and there&#8217;s also a <a href="http://malde.org/~ketil/flower" title="flower linux x86">linux executable</a> for flower, which I may be updating as development progresses.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/11/14/454-sequencing-and-parsing-the-sff-binary-format/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Optimization again: befuddled by bytestrings</title>
		<link>http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/</link>
		<comments>http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 08:13:01 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/</guid>
		<description><![CDATA[I&#8217;ve been spending the last couple of weeks working on an indexing scheme for sequences, using Bryan O&#8217;Sullivan&#8217;s Bloom filters.  Now, it turned out that when Bryan tested out the code, he found a curious problem:  Apparently, the indexing stage scaled quadratically with sequence length.  This wouldn&#8217;t have been so strange, were it not for [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been spending the last couple of weeks working on an indexing scheme for sequences, using <a href="http://www.serpentine.com/blog/" title="bos' blog">Bryan O&#8217;Sullivan&#8217;s</a> <a href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter" title="bloomfilter at hackageDB">Bloom filters</a>.  Now, it turned out that when Bryan tested out the code, he found a curious problem:  Apparently, the indexing stage scaled quadratically with sequence length.  This wouldn&#8217;t have been so strange, were it not for the fact that I saw the expected linear time usage when I ran the code.  Some more digging about revealed that my laptop also showed quadratic scaling.  Profiling showed the  culprit to be a simple pipeline-style function:</p>
<blockquote>
<pre>swords s = take (fromIntegral (seqlength s)+1-k) . map (B.take k) . B.tails . B.map toUpper $ seqdata s</pre>
</blockquote>
<p>Here, seqdata returns a lazy bytestring, which is also what&#8217;s hiding behind the <tt>B.</tt> qualifier.  Basically, this just builds the list of all lenght-<em>k</em> substrings.  This should, in my opinion, stream nicely, and run in constant space and linear time.  What on earth makes this quadratic?  On only some systems?   Time to dissect the systems in question:</p>
<p><span id="more-28"></span>Comparing our environments, one obvious difference was that the fast system was built using GHC 6.8.1, while the slow ones were 6.8.2.  So we checked with a 6.8.1 snapshot &#8211; no difference, still slow.  We checked the various source repositories involved for some stray patch that somehow had avoided distribution &#8211; nothing turned up. Comparing Cabal setup files didn&#8217;t reveal anything obvious, except that the newer Cabal emitted a lot more information.</p>
<p>Somewhere, something had to be different. Could it be bloomfilter not being a good consumer for the generated words?  The <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">bio library</a> messing up FASTA parsing?  Something else entirely?</p>
<p>Looking at the <tt>swords</tt> function, it looks like a good candidate for deforestation and/or fusion.  Could there be some optimization not being performed in some cases, for some strange reasons?  Since I remember <a href="http://www.cse.unsw.edu.au/~dons/papers/CLS07.html" title="Coutts et al (2007): Streams fusion: from lists to streams to nothing at all">fusion being mentioned</a> occasionally in the context of<a href="http://www.cse.unsw.edu.au/~dons/fps.html" title="bytestring home page"> bytestring</a>s, I checked the libraries again, and found something I&#8217;d missed previously:  Bytestring 0.9.1 on the fast system, 0.9.0.1 on the slow ones.  While the <a href="http://article.gmane.org/gmane.comp.lang.haskell.cafe/38992">announcement</a> didn&#8217;t promise more than a few percent better performance, no stone could be left unturned.  And this proved to be it: after upgrading bytestring, and recompiling binary, biolib, and bloomfilter to use it, my laptop was as fast as the server.</p>
<p>I&#8217;m not sure exactly what caused this, but apparently there were some issues with bytestring fusion, and fusion is disabled in current bytestring versions. At any rate, it seems the newer version is safer, this is now the minimum requirement in <tt>bio.cabal</tt>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A plan for Bloom filters</title>
		<link>http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/</link>
		<comments>http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/#comments</comments>
		<pubDate>Thu, 31 Jul 2008 20:00:44 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/</guid>
		<description><![CDATA[Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until Bryan O&#8217;Sullivan posted a message to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a chapter in the upcoming book.  You can read all about Bloom filters on [...]]]></description>
			<content:encoded><![CDATA[<p>Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until <a href="http://www.serpentine.com/blog/" title="bos' blog">Bryan O&#8217;Sullivan</a> posted a <a href="http://www.mail-archive.com/haskell-cafe@haskell.org/msg41876.html" title="[Haskell-cafe] [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters">message</a> to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a <a href="http://book.realworldhaskell.org/beta/bloomfilter.html" title="Real World Haskell: Chapter 27">chapter in the upcoming book</a>.  You can read all about Bloom filters on <a href="http://en.wikipedia.org/wiki/Bloom_filter">Wikipedia</a>, but the executive summary of it is that it is a structure similar to Data.Set.  Except that it is probabilistic, and may occasionally claim a value is a member when it&#8217;s not.  On the positive side, the Bloom filter is very fast, and speed is independent on the size &#8212; in other words, lookup and insert is <em>O</em>(1) where Data.Set is <em>O</em>(log <em>n</em>).</p>
<p>Comparing sequences to find similarity is a common occurrence in bioinformatics.  For instance, one might want to know where a certain gene is located in the chromosome, or which sequence fragments are similar enough to originate from the same gene. To speed up searches, it is common to index sequences in questions as overlapping, substrings (<em>k</em>-tuples, <em>q</em>-grams).  This index seems like an obvious target for Bloom filters &#8212; large data, time critical, some false positives anyway &#8212; but for some reason, there is <a href="http://hpcb.wustl.edu/pubs/mercuryblastn.pdf" title="Buhler et al.: MercuryBLASTN: Faster DNA sequence comparison using a streaming hardware architecture. (unpublished?)">almost</a> no such applications that use them. Until now.</p>
<p><span id="more-15"></span></p>
<h2>Sequence clustering</h2>
<p>Sequence clustering is a commonly used technique which is usually based on sequence similarity.  I&#8217;ve written one sequence clusterer, <em>xsact</em>, which is based on blocks of exact matches. There are many others, and another example is <a href="http://bioinformatics.oxfordjournals.org/cgi/reprint/19/3/421.pdf" title="Burke et al.: ">d2cluster</a>, which is based on occurence of fixed length words &#8212; which is right up Bloom alley, right?</p>
<p>So a straightforward way to build a Bloom filter based sequence clusterer is to represent each cluster as a set of words &#8212; stored as a Bloom filter.  Now, adding a new sequence to the clusters is a simple matter of extracting the words from the sequence, identifying the cluster(s) containing a sufficient number of these words, and adding the remaining words to that cluster (or the union of the clusters, in the case multiple clusters match).</p>
<p>The interesting thing about this approach is that the whole thing becomes <em>O</em>(<em>kn</em>), for <em>k</em> clusters and data size <em>n</em>.  I think all other clustering algorithms are based on sequence pairs, which makes them <em>O</em>(<em>n²</em>) &#8212; in the worst case, you need to check all pairs. (However, a straightforward similarity-based clustering will have worst-case behavior when no sequence math each other, while suffix-based methods like <em>xsact</em> will have worst case when all sequences match &#8212; so perhaps there is room for a better middle ground?)</p>
<p>Anyway &#8212; while the above looks promising, there is one snag: EST sequences can occur as a copy of the gene, or due various properties of the <a href="http://blog.malde.org/index.php/2008/05/08/cleaning-up-sequences/" title="Cleaning up sequences">manufacturing process</a>, the gene&#8217;s <em>reverse complement</em>.  Thus, we need to be able to compare sequences in both directions simultaneously. This could be achieved with a slightly creative hashing function, but to keep things simple, we&#8217;ll stick with the ossified mental sweat in the provided implementation.</p>
<h2>Index and search</h2>
<p>A somewhat related area is indexing and search.  Let&#8217;s say you have a bunch of DNA sequences (of for each chromosome, perhaps), and a set of ESTs, which you&#8217;ll remember are gene fragments in an unpredictable mix of forward and reverse-complement directions.  Here&#8217;s the plan:</p>
<ol>
<li>index by building a Bloom filter containing the <em>q</em>-tuples for each chromosome</li>
<li> for each EST, look up each <em>q</em>-tuple (first forward, then rev.comp.) against the filters, and assign it to the chromosome containing the most <em>q</em>-tuples</li>
<li>align each EST against the designated chromosome using traditional methods</li>
</ol>
<p>As far as I can tell at this point, for word size <em>q</em>, <em>c</em> chromosomes of length <em>m</em> and <em>e</em> ESTs of lenght <em>n</em>, this should run in something like  <em>O</em>(<em>qn</em>) + <em>O(qenc</em>) + <em>O</em>(<em>emn</em>), compared to just aligning directly at <em>O</em>(<em>ecmn</em>). Note that mn is the big factor here, so on a large scale, we&#8217;re reducing the total work by a factor of <em>c</em>.  (Of course, in real life, <em>nm</em> would be too large, and you&#8217;d use a heuristic, subquadratic alignment step, but this is for illustration purposes.  Try to be a bit generous, will you?)</p>
<h2>Further plans</h2>
<p>That concludes the current plan, but there are certainly improvements that can be made, two obvious ones are</p>
<ul>
<li>hash function that hashes a word and its reverse complement to the same value</li>
<li>break long sequences into partially (1/3) overlapping regions to speed things up even more, as well as give more accurate placements</li>
</ul>
<p>Another thing that struck me is that in my <a href="http://malde.org/~ketil/biohaskell/xml2x/" title="xml2x -- annotating EST sequences from BLASTX hits">annotation tool</a>, I could probably use this as a faster way to store the set of matching proteins before extracting GO terms.  Currently, the performance is limited by XML parsing, so it&#8217;s probably not worth the bother at the moment.</p>
<p>More is likely to come up, but now it&#8217;s time to implement something!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Optimization week: making Haskell go faster</title>
		<link>http://blog.malde.org/index.php/2008/05/18/optimization-week-making-haskell-go-faster/</link>
		<comments>http://blog.malde.org/index.php/2008/05/18/optimization-week-making-haskell-go-faster/#comments</comments>
		<pubDate>Sun, 18 May 2008 15:26:31 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/05/18/optimization-week-making-haskell-go-faster/</guid>
		<description><![CDATA[It seems to be optimization week on the haskell café mailing list.  Inspired by a combination of Don Stewart&#8217;s blog post about how to optimize for speed and  the sorry performance of my xml2x program, I thought this would be good time to see if things could be improved.  In its current [...]]]></description>
			<content:encoded><![CDATA[<p>It seems to be optimization week on the haskell café mailing list.  Inspired by a combination of Don Stewart&#8217;s <a href="http://cgi.cse.unsw.edu.au/~dons/blog/2008/05" title="Write Haskell as fast as C">blog post</a> about how to optimize for speed and  the sorry performance of my <a href="http://malde.org/~ketil/biohaskell/xml2x" title="xml2x - post processor for BLAST results">xml2x</a> program, I thought this would be good time to see if things could be improved.  In its current state, xml2x takes a few hours to process a few days worth of BLASTX output, so it&#8217;s far from critical, but faster is always better, and reading Don&#8217;s post, the intermediate output from the compiler, a.k.a. <em>ghc core</em> didn&#8217;t really look so scary.<span id="more-10"></span></p>
<h2>Profiling with Haskell</h2>
<p>First we need to identify the actual bottlenecks, using GHC&#8217;s profiling options.  This means compiling the program with <tt>-prof -auto-all</tt>, and the <tt>Makefile</tt> shipped with xml2x already has a target for this.  (I could probably do it with Cabal as well, but I never bothered to find out.  Lazy me.)  A &#8220;feature&#8221; of the <a href="http://malde.org/~ketil/biohaskell/biolib">bioinformatics library</a> is that its Cabal file contains this line:</p>
<p><code>ghc-options:         -Wall -O2 -fexcess-precision -funbox-strict-fields -auto-all<br />
</code></p>
<p>The final &#8220;auto-all&#8221; ensures that when the library gets compiled with profiling support, cost centres will be assigned to its internal functions too, thus the profiling run will reveal where time is being spent in more detail.  This practice is probably frowned upon by people who are into that kind   of thing, but here it is crucial, as the profile reveals this:</p>
<blockquote>
<pre>
Fri May 16 18:28 2008 Time and Allocation Profiling Report  (Final)

xml2x.p +RTS -p -sxml2x.p2.stats -RTS -v -C --reg --go-def=GO.terms_and_ids
  --anno=gene_association.goa_uniprot small.xml -o small3.csv

total time  =     4073.12 secs   (203656 ticks @ 20 ms)
total alloc = 1082,832,292,360 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

mkAnn                          Bio.Sequence.GOA        60.8   56.2
readXML                        Bio.Alignment.BlastXML  27.4   38.4
hsp2match                      Bio.Alignment.BlastXML   3.1    1.7
readGOA                        Bio.Sequence.GOA         1.9    0.4
hit2hit                        Bio.Alignment.BlastXML   1.6    1.8
protTerms                      Main                     1.6    0.0
sequence'                      Bio.Util                 1.1    0.2</pre>
</blockquote>
<p>In short, the function named mkAnn consumes almost two thirds of the time here, and is an obvious target for closer inspection.  The function looks like this:</p>
<blockquote>
<pre>
data Annotation = Ann !UniProtAcc !GoTerm !EvidenceCode deriving (Show)
newtype GoTerm = GO Int deriving (Eq,Ord)
type UniProtAcc = ByteString

mkAnn :: ByteString -&gt; Annotation
mkAnn = pick . B.words
    where pick (_db:up:rest) = pick' up $ findGo rest
          pick' up' (go:_:ev:_) = Ann (copy up') (read $ unpack go) (read $ unpack ev)
          findGo = dropWhile (not . B.isPrefixOf (pack "GO:"))</pre>
</blockquote>
<p>This is a slightly clunky way of extracting a bit of information from a line of text; the text in question is <a href="http://www.geneontology.org/">GOA</a>, a mapping between proteins and standard Gene Onthology terms.  It&#8217;s a text file with about 20 million lines, so we use lazy bytestrings to deal with it.  Each line is broken into words, the second word (&#8220;up&#8221;) is the protein name.  Then a variable number of words later, the GO term itself appears, and the final item we want is the third word after the GO term.  Note the <code>copy</code>, ensuring that we don&#8217;t keep around a large part of the input file.</p>
<p>So what&#8217;s wrong here?  Asking in the café, Don suggested that unpacking bytestrings is a less than recommendable practice.  Still, I find it is a useful idiom, and the <code>read $ unpack s</code> should garbage-collect the characters as soon as they are consumed.  Ideally, the string should be deforested or fused away.  I happily agree that this is a lazy programmer&#8217;s approach and not optimal &#8211; but on the other hand, it shouldn&#8217;t be <em>that</em> costly?</p>
<h2>GHC core to the rescue &#8211; not.</h2>
<p>This is the stage where I looked at GHC core.  I guess it&#8217;s one of those things that looks easy when professionals do it, but when you get home and try it on your own, you just have no idea.  The three-line function above resulted in four pages of cryptic core output.   Call me lazy and ignorant, but I decided to take just one more little peek at the source first.</p>
<p>Let&#8217;s check out those read-unpack constructs.  The read instance for <code>GoTerm</code> checks that the string starts with &#8220;GO:&#8221;, and then reads an <code>Int</code>.  The second reads an <a href="http://www.geneontology.org/GO.evidence.shtml"><code>EvidenceCode</code></a>, and since these are defined as short abbreviations in all upper case, I just wrote a data type with corresponding nullary constructors, and derived <code>Read</code> for it.</p>
<blockquote>
<pre>
data EvidenceCode = IC  -- Inferred by Curator
	          | IDA -- Inferred from Direct Assay
	          | IEA -- Inferred from Electronic Annotation
	          | IEP -- Inferred from Expression Pattern
	          | IGC -- Inferred from Genomic Context
	          | IGI -- Inferred from Genetic Interaction
	          | IMP -- Inferred from Mutant Phenotype
	          | IPI -- Inferred from Physical Interaction
	          | ISS -- Inferred from Sequence or Structural Similarity
	          | NAS -- Non-traceable Author Statement
	          | ND  -- No biological Data available
	          | RCA -- inferred from Reviewed Computational Analysis
	          | TAS -- Traceable Author Statement
	          | NR  -- Not Recorded
     deriving (Read,Show,Eq)</pre>
</blockquote>
<p>Maybe this wasn&#8217;t such a good idea after all?  Let&#8217;s try with custom, bytestring-based parsers for EvidenceCode and GoTerm:</p>
<blockquote>
<pre>
getEC :: ByteString -&gt; EvidenceCode
getEC s = case B.uncons s of
            Just ('I',s') -&gt; case B.uncons s' of
                               Just ('C',_) -&gt; IC
                               Just ('D',_) -&gt; IDA
                               Just ('E',s'') -&gt; case B.head s'' of 'A' -&gt; IEA
                                                                    'P' -&gt; IEP
                                                                    _ -&gt; e 1
                               Just ('G',s'') -&gt; case B.head s'' of 'C' -&gt; IGC
                                                                    'I' -&gt; IGI
                                                                    _ -&gt; e 2
                               Just ('M',_) -&gt; IMP
                               Just ('P',_) -&gt; IPI
                               Just ('S',_) -&gt; ISS
                               _ -&gt; e 3
            Just ('N',s') -&gt; case B.head s' of 'A' -&gt; NAS
                                               'D' -&gt; ND
                                               'R' -&gt; NR
                                               _ -&gt; e 4
            Just ('R',_) -&gt; RCA
            Just ('T',_) -&gt; TAS
            _ -&gt; e 5
    where e :: Int -&gt; a
          e n = error ("Illegal GO evidence code ("++show n++"): "++unpack s)

getGo :: ByteString -&gt; GoTerm
getGo bs = GO $ fst $ maybe e id (B.readInt $ B.drop 3 bs)
    where e = error ("Unable to parse GO term"++unpack bs)</pre>
</blockquote>
<p>Time to re-run with profiling:</p>
<blockquote>
<pre>
Sat May 17 19:25 2008 Time and Allocation Profiling Report  (Final)

xml2x.p +RTS -sxml2x.2pp.stats -p -RTS -v -C --reg --go-def=GO.terms_and_ids
    --anno=gene_association.goa_uniprot small.xml -o small5p.csv

        total time  =     2022.36 secs   (101118 ticks @ 20 ms)
        total alloc = 622,807,651,828 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

readXML                        Bio.Alignment.BlastXML  48.6   66.7
mkAnn                          Bio.Sequence.GOA        29.8   23.6
hsp2match                      Bio.Alignment.BlastXML   5.2    3.0
readGOA                        Bio.Sequence.GOA         3.6    0.7
hit2hit                        Bio.Alignment.BlastXML   3.2    3.0
protTerms                      Main                     3.1    0.0
sequence'                      Bio.Util                 2.1    0.4
countIO                        Bio.Util                 1.4    0.6
getFrom                        Bio.Alignment.BlastXML   1.1    0.7</pre>
</blockquote>
<p>We see that <code>mkAnn</code> has fallen way behind the XML parsing in time consumption, and until Neil gets around to do a bytestring version of his otherwise excellent <a href="http://www-users.cs.york.ac.uk/~ndm/tagsoup/" title="Tagsoup - tolerant XML parsing">tagsoup</a> library, there isn&#8217;t all that much left to do.  Total time has been cut in half, from over an hour to 35 minutes or so.  There&#8217;s still 30% of the run time to be shaved even more closely, perhaps you can suggest the remaining culprit?  My guess would be the <code>findGo</code> function.  How should it be rewritten?</p>
<h2>Lessons learned</h2>
<p>Profiling is nice to pinpoint the hotspots.  Bytestrings is <em>the</em> way to go for fast code.  GHC generates crappy derived parsers for data types, perhaps especially so for data types with many constructors.  Writing manual parsers is tedious, but a bit of tedium can go a long way.</p>
<p>Unfortunatly, I am still not entirely happy with xml2x.  The profiling numbers lie, as they don&#8217;t &#8212; I believe &#8212; include GC time.  This particular program uses an inordinate amount of time collecting garbage.  I <em>suspect</em> it is because I store GO terms as 16 bit integers in unboxed arrays in an insufficiently strict Data.Map, but I will have to investigate this closer.</p>
<p>But not today.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/05/18/optimization-week-making-haskell-go-faster/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
