<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Biohaskell &#187; Examples</title>
	<atom:link href="http://blog.malde.org/index.php/category/examples/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.malde.org</link>
	<description>bioinformatics and haskell</description>
	<lastBuildDate>Tue, 20 Jul 2010 15:04:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Tools for pyrosequencing analysis</title>
		<link>http://blog.malde.org/index.php/2010/02/19/tools-for-pyrosequencing-analysis/</link>
		<comments>http://blog.malde.org/index.php/2010/02/19/tools-for-pyrosequencing-analysis/#comments</comments>
		<pubDate>Fri, 19 Feb 2010 21:28:27 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Downloads]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=98</guid>
		<description><![CDATA[I recently did a brief presentation of the set of tools I&#8217;ve developed for analyzing pyrosequences (the Roche 454 variety).  Nothing spectacular, just an overview of various ways of slicing and dicing SFF files using tools written in Haskell.  For lack of a better place to put it, I&#8217;ll drop my slides below.
flowers
]]></description>
			<content:encoded><![CDATA[<p>I recently did a brief<img class="alignright" title="Hokusai poppies" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Hokusai_Poppies.jpg/800px-Hokusai_Poppies.jpg" alt="" width="260" height="179" /> presentation of the set of tools I&#8217;ve developed for analyzing pyrosequences (the Roche 454 variety).  Nothing spectacular, just an overview of various ways of slicing and dicing SFF files using tools written in Haskell.  For lack of a better place to put it, I&#8217;ll drop my slides below.</p>
<p><a title="A presentation of tools for working with SFF files" href="http://blog.malde.org/wp-content/uploads/2010/02/flowers2.pdf">flowers</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2010/02/19/tools-for-pyrosequencing-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Searching for poly(A) tails</title>
		<link>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/</link>
		<comments>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 14:03:28 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=65</guid>
		<description><![CDATA[I&#8217;m currently involved in a project where we study, among other things, the 3&#8242;UTR and poly-A tails of certain genes.  For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn&#8217;t find any program or tool to do just that.  Presumably the task is considered too [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m currently involved in a project where we study, among other things, the 3&#8242;UTR and poly-A tails of certain genes.  For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn&#8217;t find any program or tool to do just that.  Presumably the task is considered too trivial?  So, like many other &#8220;trivial&#8221; tasks, it is performed by ad-hoc solutions that are likely to be suboptimal.</p>
<p>Here&#8217;s a better method that identifies poly-A tails by finding an optimal, quality adjusted alignment in linear time.</p>
<p><span id="more-65"></span></p>
<h3>A quick introduction</h3>
<p>Although the definitions of what constitutes a gene vary considerably, we&#8217;ll use the term to refer to a region of DNA that get<em> <a title="Wikipedia entry for &quot;transcription&quot;" href="http://en.wikipedia.org/wiki/Transcription_%28genetics%29">transcribed</a></em>, that is, copied from DNA into an <a title="Wikipedia entry for &quot;Messenger RNA&quot;" href="http://en.wikipedia.org/wiki/Messenger_RNA">mRNA</a> molecule, which in turn will be used as a blueprint for assembling a protein.  After transcription, the mRNA molecule then undergoes <a title="Wikipedia entry for &quot;polyadenylation&quot;" href="http://en.wikipedia.org/wiki/Polyadenylation"><em>polyadenylation</em></a>, a process where a string of adenines (the &#8216;A&#8217; of the nucleotide alphabet) gets appended to an mRNA molecule before it is exported from the nucleus.</p>
<p>Identifying poly-A tails are important for several reasons.</p>
<ol>
<li>It positively identifies the end of the transcript.  If you don&#8217;t have a poly-A in your sequence, you have no way to know how far the molecule extends beyond the end of the sequence.  You can also find alternatively terminated transcripts this way.</li>
<li>It positively identifies the end of the transcript.  Anything after the poly-A tail is linker or vector sequence, and can safely be trimmed off.  Even if, as is often the case, it is too low quality to be recognized by your average vector masking software.</li>
<li>It provides useful information about the transcript, as the poly-A tail is important for things like protecting the mRNA from degradation.</li>
</ol>
<p>Unfortunately utilities often trim poly-A tails by default (e.g. SeqClean), or just ignore it (e.g. BLAST&#8217;s low-complexity<br />
filter).</p>
<h3>Quality based alignment</h3>
<p>When a molecule is sequenced, the analog output from the sequencing machine is stored as a <em>chromatogram</em>.  In order to be useful, the sequence is <em>called</em>, that is, translated to a string of letters from the familiar {A,C,G,T} nucleotides alphabet.  In addition, the base caller will associate each letter with a <em>quality value</em>.  This is derived from an estimate of the probability of the call being incorrect, and for quality value <em>Q,</em> the error probability estimate is</p>
<pre style="padding-left: 30px;">ϵ = 10<sup>-Q/10</sup></pre>
<p>Traditionally sequence alignment simply aligns the string of characters using a fixed positive score (reward) for aligning similar characters, and fixed negative scores (penalties) for either substituting a different character, opening a gap, or extending a previous gap.</p>
<p>However, taking into account the quality value, <a title="My paper on using sequence quality to improve alignments." href="http://bioinformatics.oxfordjournals.org/cgi/content/full/24/7/897">we can do better</a>, and instead of fixed scores, we can adjust the scores dynamically according to quality.</p>
<p>Using this method, the penalty for e.g. aligning two different characters will depend on the quality of the characters: high quality means a high penalty, low quality &#8212; lower penalty (since there&#8217;s a greater chance one of them was incorrectly called).</p>
<h3>Scoring of alignments</h3>
<p>When calculating the score of an alignment, we really want to answer the question: how likely is this sequence to be a real poly-A sequence, as opposed to just a random string?  In other words, we are comparing our sequence against two models: the poly-A model, and the background model. Our score will use the <em>ratio</em> of probabilities of the string being produced by the two models.</p>
<p>For the poly-A model, only As are allowed, so the probability of a character occurring is 1 for As and 0 for the others.  For the background model, we&#8217;ll just take a uniform distribution of nucleotides, each getting a probability of 0.25.</p>
<p>Using this scheme, the score for a string s is simply 1/0.25 = 4 for each A, and 0/0.25 = 0 for all others.  We usually work with the logarithm of these numbers to make them more manageable.</p>
<p>The optimal alignment is then simply the longest run of As, since as soon as you multiply with a zero (or add -infinity, if you use <em>log</em>-scores), you lose the whole score.</p>
<h3>Adding quality to the mix</h3>
<p>Of course, the actual sequence isn&#8217;t perfect, and even the poly-A tail is likely to contain the odd G, C, or T.  To determine exactly <em>how</em> likely is where the quality value enters the picture. Using the formula above,  we can calculate the error estimate and decide what the penalty for a mismatch and reward for a match should be.</p>
<p>For the poly-A model, the probability for a match (that is, an actual &#8216;A&#8217; in the sequence) is <em>1-<em>ϵ</em></em>, the probability of a mismatch (a non-A) is <em>ϵ/3</em> (since only one of the three possible substititutions is an A, and for simplicity, we give them equal probability).  Using the formula for  <em><em>ϵ</em></em> as a function of <em>Q</em> (and hopefully not introducing any errors), I get the scores to be:</p>
<pre style="padding-left: 30px;">match q = log (4*(1-1/10**(q/10)))
mismatch q = log 4 - log 3 - q/10*log 10</pre>
<p>Now, we can use this to do a standard Smith-Waterman alignment, calculating a dynamic programming matrix, and searching for an optimal local alignment.</p>
<p>However, since we&#8217;re aligning against a repeated nucleotide, there&#8217;s no real need for a second dimension, and we can use the following recurrence to calculate the &#8220;polyA-score&#8221; <em>M</em> for each position <em>i</em>:</p>
<p style="padding-left: 30px;"><em>M<sub>i</sub> = max (0, S<sub>i</sub> + M<sub>i-1</sub>)</em></p>
<p>To implement this, we first define the list of scores by applying match and mismatch to the list of (nucleotide,quality) pairs.  We also define a scanl-based function to calculate a list of cumulative scores:</p>
<pre style="padding-left: 30px;">scores = map (\(c,q) -&gt; if toUpper c=='A' then match q else mismatch q) qd
cumulative = scanl (\a b -&gt; let r = a + b in max 0 r)</pre>
<p>The only remaining thing is to identify the maximal value which marks the end of the poly-A tail, and the corresponding 0 value that indicates the start.   I wrote a recursive function called &#8221;findmax&#8221; for this, but a better programmer will probably be able to do this with a fold.</p>
<p>Including the parts discussed briefly above, the whole thing looks like this:</p>
<pre style="padding-left: 30px;">findPolyA :: Sequence Nuc -&gt; Maybe (Int,Int)
findPolyA (Seq _ d mq) =
let qd = zip (B.unpack d) (maybe (repeat 15) BB.unpack mq)
scores = map (\(c,q) -&gt; if toUpper c=='A' then match q else mismatch q) qd
match x' = let x = fromIntegral x' in log (4*(1-1/10**(x/10)))
mismatch x' = let x = fromIntegral x' in log 4 - log 3 - x/10*log 10
cumulative = scanl (\a b -&gt; let r = a + b in max 0 r) 0
(zi,mi,maxscore) = findmax $ cumulative scores
in if maxscore &gt; 12 then Just (zi+1,mi) else Nothing  -- arbitrary constant alert!

findmax :: [Double] -&gt; (Int,Int,Double)
findmax = go 0 (0,0,0) . zip [0..]
where go _ cm [] = cm
go _ cm ((i,0):rest) = go i cm rest
go last_z (cmz,cmi,cmx) ((i,x):rest) = if x &gt; cmx then go last_z (last_z,i,x) rest
else go last_z
(cmz,cmi,cmx) rest</pre>
<h3>Availability</h3>
<p>This method is implemented in a simple tool called &#8220;trimpolya&#8221; (<a title="Darcs repository for 'trimpolya'" href="http://malde.org/~ketil/biohaskell/trimpolya">darcs repo</a>), and also in the more general &#8220;dephd&#8221; (<a title="Darcs repository for 'dephd'" href="http://malde.org/~ketil/biohaskell/dephd">darcs</a>, <a title="Dephd at HackageDB" href="http://hackage.haskell.org/package/dephd">hackage</a>) sequence analysis package.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Installing the software on Ubuntu 9.10</title>
		<link>http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/</link>
		<comments>http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 13:26:08 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Installation]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/</guid>
		<description><![CDATA[Ubuntu 9.10, nicknamed Karmic Koala, is about to be released, and in a moment of idleness, I upgraded my old 9.04 install to the latest beta.  Upgrading is always generates a slight feeling of dread,  taking the plunge from the cozy stability of bugs I&#8217;ve learned to work around, into the great unknown, but it [...]]]></description>
			<content:encoded><![CDATA[<p>Ubuntu 9.10, nicknamed Karmic Koala, is about to be released, and in a moment of idleness, I upgraded my old 9.04 install to the latest beta.  Upgrading is always generates a slight feeling of dread,  taking the plunge from the cozy stability of bugs I&#8217;ve learned to work around, into the great unknown, but it all went even smoother than the previous one.  And on the plus side, ghc is now, <em>finally</em>, upgraded to an almost modern release, (6.10.4) and lots of libraries are included as well.  Great work by Joachim Breitner and his <a href="http://lists.debian.org/debian-haskell/" title="debian-haskell mailing list">army of debianizers</a>.  So I&#8217;m all ready to take advantage of my new compiler and its improvements, but first I need to bring all my software up to speed.  I&#8217;ll make notes here as I go along, and hopefully this will be useful also for users of other Linux distributions.</p>
<p><span id="more-54"></span></p>
<h3>Installing biolib</h3>
<p>First I need to install the bioinformatics library.  I&#8217;m about to release 0.4.1, but this is also a good opportunity to check that everything works with 0.4 (which is what you&#8217;ll find on Hackage), so let&#8217;s do that first.  Using darcs, I pull the repo up to the 0.4 tag (but you can of course get the tarball from <a href="http://hackage.haskell.org/package/bio">Hackage</a>):</p>
<p><code>% ./Setup.hs configure<br />
Configuring bio-0.4...<br />
Setup.hs: At least the following dependencies are missing:<br />
QuickCheck &lt;2, binary -any<br />
</code></p>
<p>(Side note: you may notice that I run <tt>Setup.hs</tt> directly, as opposed to using <tt>runhaskell</tt>.  I prefer it this way, but you may have to do a <tt>chmod +x Setup.hs</tt> if you downloaded this from the darcs repository or similar.)</p>
<p>Since we want to use the system libraries as far as possible, these libraries are just an apt-get away:</p>
<p><code>sudo apt-get install libghc6-quickcheck1\*<br />
sudo apt-get install libghc6-binary-\*<br />
</code></p>
<p>Now, let&#8217;s try again:</p>
<p><code>% ./Setup.hs configure<br />
Configuring bio-0.4...<br />
% ./Setup.hs build<br />
Preprocessing library bio-0.4...<br />
Building bio-0.4...<br />
Binary: Int64 truncated to fit in 32 bit Int<br />
ghc: panic! (the 'impossible' happened)<br />
(GHC version 6.10.4 for i386-unknown-linux):<br />
Prelude.chr: bad argument</code></p>
<p><code>Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug</code></p>
<p>Okay, this was not what was supposed to happen.  As always, dropping to #haskell on IRC is the first thing to do, and sure enough:</p>
<p><code>&lt;sereven&gt; ketil: that's also shown up for xmonad users when .hi and .o files weren't cleaned<br />
between rebuilds mixing different versions. usually between ghc updates IIRC.<br />
</code></p>
<p>Let&#8217;s try to get rid of old cruft lying about, polluting directories:</p>
<p><code>./Setup.hs clean &amp;&amp; ./Setup.hs configure &amp;&amp; ./Setup.hs build</code></p>
<p>Sure enough, this time it worked.  For good measure, we&#8217;ll run the unit tests:</p>
<p><code>make test</code></p>
<p>After a zillion tests, we notice that everything is go, great!</p>
<h3>Applications</h3>
<p>Next, it is time to go through the list of bioinformatics applications.  Since my working directory is a mess of branches and versions, we&#8217;ll just go over the published applications and versions on Hackage.</p>
<p><strong>xsact</strong> is an application to do sequence (in particular EST sequence) clustering.  It predates and thues doesn&#8217;t actually use the bioinformatics library, but we&#8217;ll check it anway.  So we try the familiar command line:</p>
<p><code>./Setup.hs clean &amp;&amp; ./Setup.hs configure &amp;&amp; ./Setup.hs build</code></p>
<p>And things compile.  However, the version on Hackage is outdated, so we&#8217;ll upload a new version, 1.7.1.  One test case still fails, but I can&#8217;t imagine anybody is using it to generate Newick-formatted trees &#8212; I am certainly not &#8212; and since there are many equally correct outputs (including tree rotations and rounding modes), output is likely correct anyway.  Holler if you need it!</p>
<p><strong>rbr</strong> is an application to mask repeats in sequence data.  Normally, this is done using a library of known repeats, but this application tries to do it using statistics, making the &#8212; I think justifiable &#8212; assumption that repeats are going to be more common than non-repeats. The version on Hackage is old, and only works with the library prior to 0.4, so again this is a good time to push the latest changes out in the limelight.  Compiling this works great, by the way.</p>
<p><strong>cluster_tools</strong> is a package that contains a bunch of binaries, useful for working with the results of sequence clustering, including extracting various information from ACE files.  This uses another library, called <em>simpleargs</em>, that simplifies command line argument parsing for simple cases.  Again, the Hackage version is for bio&lt;0.4, so a new version will be pushed.  At the same time, we make a mental note to push version 0.2 of simpleargs to Hackage as well, instead of keeping age-old modifications buried forever.</p>
<p><strong>dephd</strong> is my Swiss-army-knife of sequence analysis, and lets you do various things like converting between formats, plotting and trimming by quality.  This is a more live project than most of the others (I&#8217;m currently working on improved quality trimming and automatic generation of files for submission to GenBank), but the currently available version also compiles without incident.</p>
<p><strong>estreps</strong> is a couple of programs I needed for repeat analysis, perhaps not tremendously interesting, but at least <tt>rselect</tt>, which lets you select randomized subsets from Fasta files, might be of interest to some?  We try the usual invocation to compile, and get:</p>
<p><code>src/Unigene.hs:24:23:<br />
Couldn't match expected type `a' against inferred type `Unknown'<br />
`a' is a rigid type variable bound by<br />
the type signature for `clusters' at src/Unigene.hs:22:41<br />
</code></p>
<p>This error arises due to the introduction of phantom types for identifying sequences introduced in bio 0.4.  Unfortunately, this version of <em>estreps</em> contains some adaption to this model, so it won&#8217;t compile against older versions either.  So it looks like yet another sdist for Hackage.  Look for version 0.3.1.</p>
<p><strong>flower</strong> is a utility for extracting information from SFF files (containing sequences from Roche&#8217;s 454 machines).  Although a new version is around the corner, the old 0.2 just works.</p>
<p><strong>xml2x</strong> is a utility for converting BLAST results in XML format into CSVs, that somehow is more compatible with biologists.  Trying to compile it fails, with the following error:<br />
<code><br />
src/Xml2X.hs:152:49:<br />
Couldn't match expected type `[b]'<br />
against inferred type `Maybe Bio.Sequence.KEGG.KO'<br />
In the first argument of `concatMap', namely `(flip M.lookup ks)'<br />
In the first argument of `($)', namely<br />
`concatMap (flip M.lookup ks)'<br />
In the second argument of `($)', namely<br />
`concatMap (flip M.lookup ks) $ map chop $ map subject fs'<br />
</code></p>
<p>It turns out that somewhere along the way, the <tt>lookup</tt> function from <tt>Data.Map</tt> was <a href="http://hackage.haskell.org/trac/ghc/ticket/2309" title="GHC ticket">de-generalized</a> from working on arbitrary monads to just returning a <tt>Maybe</tt>.  I was using this to return a list, using the empty list to signal an unsuccessful lookup.  This is easily remedied, but that means yet another sdist for Hackage.</p>
<p><strong>korfu</strong> is a utility for identifying open reading frames in sequence data.  It hasn&#8217;t yet been ported to version 0.4 of the library, but works if you install 0.3.5.  I updated this too, since it didn&#8217;t have a category.  Now it too resides in the bioinformatics section.</p>
<h3>Summing up</h3>
<p>In retrospect, it seems like giving old code a thorough spring cleaning once in a while.  Although nothing really critical or difficult happened, a good number of small annoyances were discovered, and a bunch of new sdists are now ready to be uploaded to Hackage.   Next will be converting all this into debian packages.</p>
<p>The important question is of course, how do we avoid this in the future?  During development, it is important to be able to modify libraries and appliations, but installing a new version of the biolib, say, overwrites the old one, and suddenly I&#8217;m compiling and testing everything against a different library than Joe Random Hackage User is going to find.  I have some thoughts on how to avoid this, but if you have a method that works nicely, I&#8217;m all ears.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A (too) brief Biohaskell presentation</title>
		<link>http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/</link>
		<comments>http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/#comments</comments>
		<pubDate>Tue, 15 Sep 2009 12:27:00 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Downloads]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/</guid>
		<description><![CDATA[I was recently in Trondheim, and got an opportunity to present Haskell to an audience of bioinformaticians.  Alas, it is hard to describe Haskell in all its glory to the uninitiated in forty-five minutes, and especially when I also wanted to talk a bit about the application to bioinformatics.  I left in the belief that [...]]]></description>
			<content:encoded><![CDATA[<p>I was recently in Trondheim, and got an opportunity to present Haskell to an audience of bioinformaticians.  Alas, it is hard to describe Haskell in all its glory to the uninitiated in forty-five minutes, and especially when I also wanted to talk a bit about the application to bioinformatics.  I left in the belief that I managed to communicate some of the ideas, and submit the slides and other material here for posterity.  (I&#8217;m happy to receive comments, too, just in case I&#8217;ll do a revised version of the talk later on).</p>
<ul>
<li> <a href="http://blog.malde.org/wp-content/uploads/2009/09/biohaskell.pdf" title="biohaskell slides">My biohaskell slides</a></li>
<li><a href="http://blog.malde.org/wp-content/uploads/2009/09/biohaskell.tex" title="slides’ LaTeX source, using beamer.cls">slides’ LaTeX source, using beamer.cls</a></li>
<li><a href="http://blog.malde.org/wp-content/uploads/2009/09/lpssm.ps" title="Lazy PSSM  paper">Lazy PSSM JFP-paper</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing ints</title>
		<link>http://blog.malde.org/index.php/2009/08/31/parsing-ints/</link>
		<comments>http://blog.malde.org/index.php/2009/08/31/parsing-ints/#comments</comments>
		<pubDate>Mon, 31 Aug 2009 13:06:35 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/08/31/parsing-ints/</guid>
		<description><![CDATA[A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the quality data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://upload.wikimedia.org/wikipedia/commons/f/ff/Cempedak_opened1.JPG" alt="Artocarpus integer" align="right" width="376" height="250" />A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the <em>quality</em> data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on the slow side, I discovered that although parsing Fasta sequence files (with <strong>readFasta</strong> from <strong>Bio.Sequence.Fasta</strong>) processes at a rate in excess of 200MB/s on my laptop, reading sequences with quality data (using <strong>readFastaQual</strong>) was much slower, about 2-3 MB/s.   After some investigation and a few rewrites, it&#8217;s up to about 15MB/s, but still pretty far from plain sequence.  Below are three (and a half) different versions, and the hope somebody can improve on it even furter.</p>
<p><span id="more-48"></span></p>
<h3>The original version</h3>
<p>I&#8217;ve taken the liberty of cleaning things up a bit, and removing some context that I sure hope isn&#8217;t necessery.  Bascially, the task is to take a list of ByteString input lines consisting of whitespace separated decimal integers, and build a ByteString consisting of single byte quality values corresponding to those integers.</p>
<p>Below is how the naive quality parsing function might look, unpacking each word to <strong>String</strong> in order to use <strong>read</strong>.  Note that <strong>B</strong> is lazy ByteString.Char8, <strong>BB</strong> is lazy ByteString, that is, the <strong>Word8</strong> version.</p>
<p><tt> </tt></p>
<pre><tt>BB.pack $ map (read . B.unpack) $ B.words $ B.unlines ls
</tt></pre>
<p><tt></tt></p>
<p>We don&#8217;t expect this to do terribly well, and I guess it&#8217;s no surprise when this parses my test file of about 10MB in 24 seconds.</p>
<h3>Improved versions</h3>
<p>The key to improved performance is first and foremost to avoid the unpacking and parsing of strings.  Thankfully, the ByteString library provides a <strong>readInt</strong> function.</p>
<p><tt> </tt></p>
<pre><tt>
BB.pack [lookup x | x &lt;- concatMap B.words ls]
    where
    lookup x = case B.readInt x of Just (v,_) -&gt; fromIntegral v
                                   Nothing -&gt; error "Unparsable qual value"
</tt></pre>
<p>This isn&#8217;t a lot more complicated than our initial attempt, but the speed increas is considerable: slightly less than 2 seconds for the test file, more than a tenfold improvement.  The ByteString implementation will share the storage of the separate words with the original strings, but since <strong>readInt</strong> gives us back the rest of the string in addition to the parsed integer, we might as well make use of it:</p>
<pre><tt>
BB.pack $ readInts $ B.unlines ls
    where readInts xs = case B.readInt xs of </tt></pre>
<pre><tt>                          Just (i,rest) -&gt; fromIntegral i : readInts (B.dropWhile isSpace rest)
                          Nothing -&gt; []
</tt></pre>
<p>This turns out to be a bit faster, time is now 1.6 seconds.  Another 20% shaved off.</p>
<h3>Final version</h3>
<p>We&#8217;re really only interested in <strong>Word8</strong> values, since quality values always are small, and since that&#8217;s what gets encoded in the result anyway.  The previous versions takes a detour by reading <strong>Int</strong>s and using <strong>fromIntegral</strong> to convert them to the desired size.  It bears noting that there is no error checking involved, <strong>fromIntegral</strong> will happily and silently truncate any number beyond its target range.  So lets do things explicitly, using <strong>Word8</strong>s throughout the computation:<br />
<tt> </tt></p>
<pre><tt>
BB.pack $ go 0 ls
    where
    isDigit x = x &lt;= 58 &amp;&amp; x &gt;= 48
    go i (s:ss) = case BB.uncons s of </tt></pre>
<pre><tt>                    Just (c,rs) -&gt; if isDigit c then go (c - 48 + 10*i) (rs:ss)
                                   else let rs' = BB.dropWhile (not . isDigit) rs
                                        in if BB.null rs' then i : go 0 ss else i : go 0 (rs':ss)
                    Nothing -&gt; i : go 0 ss
    go _ [] = []
</tt></pre>
<p>This is the fastest one so far, clocking in at 0.94 seconds, over 40% faster than the best <strong>readInt</strong> version, and about 25 times faster than the naive version.  Still, 10MB/s is well below the average hard disk.</p>
<p>So is there more room for improvement?  The most obvious wart to me is the rather artificial splitting into lines.  This is mostly an artifact of some early design desicions, and it should be possible to eliminate the splitting earlier on and saving even more time by simplify this function quite a bit.</p>
<p>If you spot anything else, or have suggestions, I (and my darcs repo) am all ears.</p>
<p><strong>Edit:</strong> Since some people have asked, I&#8217;ve wrapped up a simple test program along with some test files at <a href="http://malde.org/~ketil/biohaskell/qualparsetest ">http://malde.org/~ketil/biohaskell/qualparsetest</a>.  This is a simplified version, if you want to be <em>really</em> helpful, you could always look at <strong>Bio.Sequence.Fasta</strong> in the <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">Bioinformatics library</a> and see if you can speed up e.g. <em>dephd -i input.fasta input.qual -F /dev/null</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/08/31/parsing-ints/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>454 sequencing and parsing the SFF binary format</title>
		<link>http://blog.malde.org/index.php/2008/11/14/454-sequencing-and-parsing-the-sff-binary-format/</link>
		<comments>http://blog.malde.org/index.php/2008/11/14/454-sequencing-and-parsing-the-sff-binary-format/#comments</comments>
		<pubDate>Fri, 14 Nov 2008 13:13:33 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/11/14/454-sequencing-and-parsing-the-sff-binary-format/</guid>
		<description><![CDATA[Roche&#8217;s 454 sequencing technology can produce biological sequence data on a scale that exceeds traditional Sanger sequencing by orders of magnitude.   Due to the fundamentally different method used to generate the sequences, we would like to investigate the raw data and see if we can quantify &#8212; and maybe also reduce the number or severity [...]]]></description>
			<content:encoded><![CDATA[<p>Roche&#8217;s 454 sequencing technology can produce biological sequence data on a scale that exceeds traditional Sanger sequencing by orders of magnitude.   Due to the fundamentally different method used to generate the sequences, we would like to investigate the raw data and see if we can quantify &#8212; and maybe also reduce the number or severity of &#8212; errors.  This means reading the binary SFF format. Below, we&#8217;ll dissect the SFF format, and describe a Haskell implementation.</p>
<p><span id="more-30"></span></p>
<p>The SFF file format is documented in the <a href="http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf" title="GS_FLX manual">GS FLX documentation</a><a href="http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf" title="GS 20 documentation"></a>, page 445-448.  There also exist an open C implementation in io_lib from the  <a href="http://staden.sourceforge.net/" title="Staden package home page at sourceforge">Staden package</a>, in addition to the proprietary implementation provided to Roche&#8217;s customers bundled with the sequencing machine.</p>
<h2>structure and implementation</h2>
<p>The SFF format is a relatively straightforward one, consisting of a header (the<em> common header</em>), followed by a number of <em>read blocks</em> corresponding to the individual sequences.  Each read block contains a <em>header block</em> and a<em> data block</em>. There is also an optional index, whose format is not defined as part of the SFF format.</p>
<p>Here&#8217;s the direct translation to Haskell:</p>
<blockquote>
<pre>
data SFF = SFF CommonHeader [ReadBlock]

data CommonHeader = CommonHeader {
          magic                                   :: Word32
        , version                                 :: Word32
        , index_offset                            :: Int64
        , index_length, num_reads                 :: Int32
        , cheader_length, key_length, flow_length :: Int16
        , flowgram_fmt                            :: Word8
        , flow, key                               :: ByteString
     }

data ReadHeader = ReadHeader {
      rheader_length, name_length           :: Int16
    , num_bases                             :: Int32
    , clip_qual_left, clip_qual_right
    , clip_adapter_left, clip_adapter_right :: Int16
    , read_name                             :: ByteString
}

data ReadBlock = ReadBlock {
      read_header                :: ReadHeader
    -- The data block
    , flowgram                   :: [Int16]
    , flow_index, bases, quality :: ByteString
    }</pre>
</blockquote>
<p>We can almost get by with these data structures and the straightforward Binary instances.  One slightly complicating feature is that each block must be aligned to an 8-byte boundary.  Another one is that we will in some places need some information (lengths and counts) from previously read data, as none of the arrays are size-prefixed.</p>
<p>Note that the data structures above represent the data on disk very closely, later on, static items like the magic number and version will be removed and hardwired into the code instead.</p>
<h2>Testing and optimization</h2>
<p>While I love to do automated unit testing with QuickCheck, it&#8217;s not so straightforward to generate random instances for complex structures like flowgrams.  So in order to test the program, I added two functions to the library: &#8216;test&#8217; and &#8216;convert&#8217;. The former just prints the information from the CommonHeader and the two first ReadBlocks.  I can then use sffinfo or a similar tool to check that the two programs are in agreement.   The latter reads an SFF file building the SFF data structure, and then serializes it back to a file (modulo the index).</p>
<p>I also wrote a small application, &#8216;flower&#8217;, that reads SFF files and provides various outputs.  The first one is just a table with all the flow values, which will be useful for statistical analysis of these data.  This immediately revealed a performance problem.    (The file contains about 120 000 sequences, each with 168 flow values):</p>
<pre>
./flower ../biolib/DZX0XNV01.sff &gt; /dev/null  265.68s user 1.08s system 99% cpu 4:28.76 total</pre>
<p>The good news is that reading the SFF file is quite fast, the bad news is that formatting the information takes a relatively large amount of time.  Here&#8217;s the offending code:</p>
<blockquote>
<pre>
showread :: CommonHeader -&gt; ReadBlock -&gt; [String]
showread h rd = let rn = unpack (read_name $ read_header rd)
in map ((p,c,v) -&gt; printf "%st%dt%st%1.2f" rn p [c] (fi v))
$ zip3 [(1::Int)..] (unpack $ flow h) (flowgram rd)

fi :: Int16 -&gt; Double
fi = (/100) . fromIntegral</pre>
</blockquote>
<p>The main culprit turned out to be &#8216;printf&#8217;, which I suspect needs to interpret the format string every time.  Also, we&#8217;ll need a way to format floating point values.</p>
<blockquote>
<pre>
showread :: CommonHeader -&gt; ReadBlock -&gt; [String]
showread h rd = let rn = unpack (read_name $ read_header rd)
in map ((p,c,v) -&gt; concat [rn,"t",show p,"t",[c],"t",fi v]) -- printf "%st%dt%st%1.2f" rn p [c] (fi v))
$ zip3 [(1::Int)..] (unpack $ flow h) (flowgram rd)

fi :: Int16 -&gt; String
fi = (x -&gt; showFFloat (Just 2) x "") . (/100) . fromIntegral</pre>
</blockquote>
<p>Running this gives us:</p>
<pre>
./flower ../biolib/DZX0XNV01.sff &gt; /dev/null  192.65s user 0.69s system 99% cpu 3:15.01 total</pre>
<p>Of course,  we&#8217;re still building up Strings &#8212; that is, linked lists of unicode characters, which is very often a performance problem.  Let&#8217;s try the universal string fastifier: Data.Bytestring.  And while we could add a &#8216;pack&#8217; to the floating point formatting, we&#8217;ll try to build it directly:</p>
<blockquote>
<pre>
showread :: CommonHeader -&gt; ReadBlock -&gt; [ByteString]
showread h rd = let rn = read_name $ read_header rd
in map ((p,c,v) -&gt; B.concat [rn,t,B.pack (show p),t,B.pack [c],t,fi v])
$ zip3 [(1::Int)..] (unpack $ flow h) (flowgram rd)

fi :: Int16 -&gt; ByteString
fi i = let (a,x) = i `divMod` 1000
(b,y) = x `divMod` 100
(c,d) = y `divMod` 10
mkdigit = chr . (+48) . fromIntegral
in B.pack [mkdigit a,mkdigit b,'.',mkdigit c,mkdigit d]</pre>
</blockquote>
<p>The variable t is just a bytestring tab.  Let&#8217;s benchmark it again:</p>
<pre>
./flower ../biolib/DZX0XNV01.sff &gt; /dev/null  55.70s user 0.35s system 99% cpu 56.217 total</pre>
<p>It turns out that floating point formatting is still a problem.   Let&#8217;s replace it with a table lookup:</p>
<blockquote>
<pre>
fi = (!) farray

farray :: Array Int16 ByteString
farray = listArray (0,10000) [B.pack (showFFloat (Just 2) i "") | i &lt;- [0,0.01..99.99::Double]]</pre>
</blockquote>
<p>This is about as fast as it gets:</p>
<pre>
./flower -f ../biolib/DZX0XNV01.sff &gt; /dev/null  39.08s user 0.26s system 99% cpu 39.573 total</pre>
<p>Note that it generates about half a gigabyte of output, or a rate of more than 10MB/s &#8212; faster than wc can count it.  Flower can also extract the reads (Fasta format, takes 0.7 seconds, approximately 200 000 seqeunces per second), reads with quality (Fastq format, about same speed), or a histogram of flow value frequencies (somewhat slower at 2.2 seconds).</p>
<h2>Availability</h2>
<p>The bioinformatics library and the flower source code are available <a href="http://malde.org/~ketil/biohaskell/" title="darcs repositories">here </a>, and there&#8217;s also a <a href="http://malde.org/~ketil/flower" title="flower linux x86">linux executable</a> for flower, which I may be updating as development progresses.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/11/14/454-sequencing-and-parsing-the-sff-binary-format/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Optimization again: befuddled by bytestrings</title>
		<link>http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/</link>
		<comments>http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 08:13:01 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/</guid>
		<description><![CDATA[I&#8217;ve been spending the last couple of weeks working on an indexing scheme for sequences, using Bryan O&#8217;Sullivan&#8217;s Bloom filters.  Now, it turned out that when Bryan tested out the code, he found a curious problem:  Apparently, the indexing stage scaled quadratically with sequence length.  This wouldn&#8217;t have been so strange, were it not for [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been spending the last couple of weeks working on an indexing scheme for sequences, using <a href="http://www.serpentine.com/blog/" title="bos' blog">Bryan O&#8217;Sullivan&#8217;s</a> <a href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter" title="bloomfilter at hackageDB">Bloom filters</a>.  Now, it turned out that when Bryan tested out the code, he found a curious problem:  Apparently, the indexing stage scaled quadratically with sequence length.  This wouldn&#8217;t have been so strange, were it not for the fact that I saw the expected linear time usage when I ran the code.  Some more digging about revealed that my laptop also showed quadratic scaling.  Profiling showed the  culprit to be a simple pipeline-style function:</p>
<blockquote>
<pre>swords s = take (fromIntegral (seqlength s)+1-k) . map (B.take k) . B.tails . B.map toUpper $ seqdata s</pre>
</blockquote>
<p>Here, seqdata returns a lazy bytestring, which is also what&#8217;s hiding behind the <tt>B.</tt> qualifier.  Basically, this just builds the list of all lenght-<em>k</em> substrings.  This should, in my opinion, stream nicely, and run in constant space and linear time.  What on earth makes this quadratic?  On only some systems?   Time to dissect the systems in question:</p>
<p><span id="more-28"></span>Comparing our environments, one obvious difference was that the fast system was built using GHC 6.8.1, while the slow ones were 6.8.2.  So we checked with a 6.8.1 snapshot &#8211; no difference, still slow.  We checked the various source repositories involved for some stray patch that somehow had avoided distribution &#8211; nothing turned up. Comparing Cabal setup files didn&#8217;t reveal anything obvious, except that the newer Cabal emitted a lot more information.</p>
<p>Somewhere, something had to be different. Could it be bloomfilter not being a good consumer for the generated words?  The <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">bio library</a> messing up FASTA parsing?  Something else entirely?</p>
<p>Looking at the <tt>swords</tt> function, it looks like a good candidate for deforestation and/or fusion.  Could there be some optimization not being performed in some cases, for some strange reasons?  Since I remember <a href="http://www.cse.unsw.edu.au/~dons/papers/CLS07.html" title="Coutts et al (2007): Streams fusion: from lists to streams to nothing at all">fusion being mentioned</a> occasionally in the context of<a href="http://www.cse.unsw.edu.au/~dons/fps.html" title="bytestring home page"> bytestring</a>s, I checked the libraries again, and found something I&#8217;d missed previously:  Bytestring 0.9.1 on the fast system, 0.9.0.1 on the slow ones.  While the <a href="http://article.gmane.org/gmane.comp.lang.haskell.cafe/38992">announcement</a> didn&#8217;t promise more than a few percent better performance, no stone could be left unturned.  And this proved to be it: after upgrading bytestring, and recompiling binary, biolib, and bloomfilter to use it, my laptop was as fast as the server.</p>
<p>I&#8217;m not sure exactly what caused this, but apparently there were some issues with bytestring fusion, and fusion is disabled in current bytestring versions. At any rate, it seems the newer version is safer, this is now the minimum requirement in <tt>bio.cabal</tt>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The FastQ file format for sequences</title>
		<link>http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/</link>
		<comments>http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/#comments</comments>
		<pubDate>Tue, 09 Sep 2008 13:32:16 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/</guid>
		<description><![CDATA[It was just brought to my attention that people have started to use a new file format for sequences.  This format, called &#8216;FastQ&#8217; combines both the sequence data itself and the quality data in one file.  That&#8217;s a nice idea, and I implemented support for it, tests, docs and all, in the bio library.  Runs [...]]]></description>
			<content:encoded><![CDATA[<p>It was just brought to my attention that people have started to use a new file format for sequences.  This format, <a href="http://www.bioperl.org/wiki/FASTQ_sequence_format" title="BioPerl Wiki: definition of FastQ">called &#8216;FastQ&#8217;</a> combines both the sequence data itself and the quality data in one file.  That&#8217;s a nice idea, and I implemented support for it, tests, docs and all, in the bio library.  Runs fast, too.  Basically, the format is a sequence of records, each one similar to this:<br />
<span id="more-29"></span></p>
<blockquote>
<pre>
@{sequence header}
{sequence data}
+{sequence header}
{quality data}</pre>
</blockquote>
<p>Note that the sequence header is repeated in there, apparently somebody thought that would be a good idea.   The <tt>{sequence data}</tt> part looks like it does in a Fasta file, except that here it has to be on a single line.  The <tt>{quality data}</tt> is ASCII, each letter representing the quality value 33 lower than it&#8217;s ASCII value.  This opens up another possibility of getting it wrong, since the line of quality data can (and will!) start out with &#8216;+&#8217; or &#8216;@&#8217; occasionally.</p>
<p>Anyway, the implementation seems to be pretty efficient, I wrote a simple program to count the number of sequences in a file:</p>
<blockquote>
<pre>
module Main where

import Bio.Sequence
import System

main = do
  [f] &lt;- getArgs
  print . length =&lt;&lt; readFastQ f</pre>
</blockquote>
<p>Testing it on <a href="ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA000271/fastq/200x36x36-071113_EAS56_0053-s_1_1.fastq.gz">this 440MB file</a> ran in (first cold, then warm cache):</p>
<blockquote>
<pre>
./countFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  1.77s user 0.41s system 53% cpu 4.091 total
./countFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  1.26s user 0.19s system 89% cpu 1.631 total</pre>
</blockquote>
<p>For comparison, I also tried it with &#8216;grep&#8217;:</p>
<blockquote>
<pre>
grep '^@' ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  5.61s user 0.46s system 98% cpu 6.181 total</pre>
</blockquote>
<p>..and in addition to being 50% slower, it gives the wrong answer, since the &#8216;@&#8217; delimiter may occur in the quality data.  Another nice thing is that the Haskell program is IO bound, so it would be even faster if I had a better disk.  (Note to self: talk to boss about getting an SSD for my laptop).</p>
<p><strong>Update</strong>: I did a quick test, comparing it to the old Fasta format.  First, to convert, I replaced the last line in the program above with</p>
<blockquote>
<pre>
readFastQ f &gt;&gt;= writeFasta (f++".fasta")</pre>
</blockquote>
<p>and compiled it as <tt>convertFQ</tt>.  Running this:</p>
<blockquote>
<pre>
./convertFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  9.51s user 2.76s system 40% cpu 30.037 total</pre>
</blockquote>
<p>Then, I made a <tt>countFa</tt> by changing the last line to:</p>
<blockquote>
<pre>print . length =&lt;&lt; readFasta f</pre>
</blockquote>
<p>Running this on the Fasta file generated just now (282MB), I get:</p>
<blockquote>
<pre>./countFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  1.26s user 0.19s system 89% cpu 1.631 total</pre>
</blockquote>
<p>Here, grep takes 2.25 seconds user time (and gives the correct answer), we&#8217;re still faster.</p>
<p>The only cloud on the horizon is that there is some disagreement about the format.  For example, <a href="http://may2005.archive.ensembl.org/Docs/Pdoc/bioperl-live/Bio/SeqIO/fastq.html" title="Minority report?">somebody thinks quality</a> always should start with a &#8216;!&#8217; (<em>viz</em>. zero).  <a href="http://maq.sourceforge.net/fastq.shtml">Maq developers think</a> it&#8217;s okay to drop the repeated sequence name.  <a href="http://www.rockefeller.edu/genomics/solexa.php">Rockefeller thinks</a> the quality data should be text digits, like the Qual format. And the good people at Solexa had to go and slightly alter&#8230;not the format itself, but <a href="http://maq.sourceforge.net/fastq.shtml" title="FastQ format description at sourceforge">it&#8217;s interpretation,</a> using a different formula to calculate the error probabilities from the quality values. So basically, given a file, there&#8217;s no way to know whether it uses Solexa-style quality information, or regular, Phred-style quality.  If you get <a href="http://maq.sourceforge.net/fastq.shtml">quality values above 60</a>, then maybe you&#8217;re interpreting it wrong.  Then again, maybe not.  Sigh.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A plan for Bloom filters</title>
		<link>http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/</link>
		<comments>http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/#comments</comments>
		<pubDate>Thu, 31 Jul 2008 20:00:44 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/</guid>
		<description><![CDATA[Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until Bryan O&#8217;Sullivan posted a message to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a chapter in the upcoming book.  You can read all about Bloom filters on [...]]]></description>
			<content:encoded><![CDATA[<p>Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until <a href="http://www.serpentine.com/blog/" title="bos' blog">Bryan O&#8217;Sullivan</a> posted a <a href="http://www.mail-archive.com/haskell-cafe@haskell.org/msg41876.html" title="[Haskell-cafe] [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters">message</a> to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a <a href="http://book.realworldhaskell.org/beta/bloomfilter.html" title="Real World Haskell: Chapter 27">chapter in the upcoming book</a>.  You can read all about Bloom filters on <a href="http://en.wikipedia.org/wiki/Bloom_filter">Wikipedia</a>, but the executive summary of it is that it is a structure similar to Data.Set.  Except that it is probabilistic, and may occasionally claim a value is a member when it&#8217;s not.  On the positive side, the Bloom filter is very fast, and speed is independent on the size &#8212; in other words, lookup and insert is <em>O</em>(1) where Data.Set is <em>O</em>(log <em>n</em>).</p>
<p>Comparing sequences to find similarity is a common occurrence in bioinformatics.  For instance, one might want to know where a certain gene is located in the chromosome, or which sequence fragments are similar enough to originate from the same gene. To speed up searches, it is common to index sequences in questions as overlapping, substrings (<em>k</em>-tuples, <em>q</em>-grams).  This index seems like an obvious target for Bloom filters &#8212; large data, time critical, some false positives anyway &#8212; but for some reason, there is <a href="http://hpcb.wustl.edu/pubs/mercuryblastn.pdf" title="Buhler et al.: MercuryBLASTN: Faster DNA sequence comparison using a streaming hardware architecture. (unpublished?)">almost</a> no such applications that use them. Until now.</p>
<p><span id="more-15"></span></p>
<h2>Sequence clustering</h2>
<p>Sequence clustering is a commonly used technique which is usually based on sequence similarity.  I&#8217;ve written one sequence clusterer, <em>xsact</em>, which is based on blocks of exact matches. There are many others, and another example is <a href="http://bioinformatics.oxfordjournals.org/cgi/reprint/19/3/421.pdf" title="Burke et al.: ">d2cluster</a>, which is based on occurence of fixed length words &#8212; which is right up Bloom alley, right?</p>
<p>So a straightforward way to build a Bloom filter based sequence clusterer is to represent each cluster as a set of words &#8212; stored as a Bloom filter.  Now, adding a new sequence to the clusters is a simple matter of extracting the words from the sequence, identifying the cluster(s) containing a sufficient number of these words, and adding the remaining words to that cluster (or the union of the clusters, in the case multiple clusters match).</p>
<p>The interesting thing about this approach is that the whole thing becomes <em>O</em>(<em>kn</em>), for <em>k</em> clusters and data size <em>n</em>.  I think all other clustering algorithms are based on sequence pairs, which makes them <em>O</em>(<em>n²</em>) &#8212; in the worst case, you need to check all pairs. (However, a straightforward similarity-based clustering will have worst-case behavior when no sequence math each other, while suffix-based methods like <em>xsact</em> will have worst case when all sequences match &#8212; so perhaps there is room for a better middle ground?)</p>
<p>Anyway &#8212; while the above looks promising, there is one snag: EST sequences can occur as a copy of the gene, or due various properties of the <a href="http://blog.malde.org/index.php/2008/05/08/cleaning-up-sequences/" title="Cleaning up sequences">manufacturing process</a>, the gene&#8217;s <em>reverse complement</em>.  Thus, we need to be able to compare sequences in both directions simultaneously. This could be achieved with a slightly creative hashing function, but to keep things simple, we&#8217;ll stick with the ossified mental sweat in the provided implementation.</p>
<h2>Index and search</h2>
<p>A somewhat related area is indexing and search.  Let&#8217;s say you have a bunch of DNA sequences (of for each chromosome, perhaps), and a set of ESTs, which you&#8217;ll remember are gene fragments in an unpredictable mix of forward and reverse-complement directions.  Here&#8217;s the plan:</p>
<ol>
<li>index by building a Bloom filter containing the <em>q</em>-tuples for each chromosome</li>
<li> for each EST, look up each <em>q</em>-tuple (first forward, then rev.comp.) against the filters, and assign it to the chromosome containing the most <em>q</em>-tuples</li>
<li>align each EST against the designated chromosome using traditional methods</li>
</ol>
<p>As far as I can tell at this point, for word size <em>q</em>, <em>c</em> chromosomes of length <em>m</em> and <em>e</em> ESTs of lenght <em>n</em>, this should run in something like  <em>O</em>(<em>qn</em>) + <em>O(qenc</em>) + <em>O</em>(<em>emn</em>), compared to just aligning directly at <em>O</em>(<em>ecmn</em>). Note that mn is the big factor here, so on a large scale, we&#8217;re reducing the total work by a factor of <em>c</em>.  (Of course, in real life, <em>nm</em> would be too large, and you&#8217;d use a heuristic, subquadratic alignment step, but this is for illustration purposes.  Try to be a bit generous, will you?)</p>
<h2>Further plans</h2>
<p>That concludes the current plan, but there are certainly improvements that can be made, two obvious ones are</p>
<ul>
<li>hash function that hashes a word and its reverse complement to the same value</li>
<li>break long sequences into partially (1/3) overlapping regions to speed things up even more, as well as give more accurate placements</li>
</ul>
<p>Another thing that struck me is that in my <a href="http://malde.org/~ketil/biohaskell/xml2x/" title="xml2x -- annotating EST sequences from BLASTX hits">annotation tool</a>, I could probably use this as a faster way to store the set of matching proteins before extracting GO terms.  Currently, the performance is limited by XML parsing, so it&#8217;s probably not worth the bother at the moment.</p>
<p>More is likely to come up, but now it&#8217;s time to implement something!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Functional bash: bracketing</title>
		<link>http://blog.malde.org/index.php/2008/07/11/functional-bash-bracketing/</link>
		<comments>http://blog.malde.org/index.php/2008/07/11/functional-bash-bracketing/#comments</comments>
		<pubDate>Fri, 11 Jul 2008 08:23:36 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/07/11/functional-bash-bracketing/</guid>
		<description><![CDATA[My current development project is an EST pipeline.  For various reasons, it is implemented in shell &#8212; bash, to be exact.  In other words, the pipeline is a script, or rather a set of scripts, that will tie together the various stages: masking, clustering, assembly, and annotation.
As in any program, there are many occasions where [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.malde.org/wp-content/uploads/2008/07/hazard_lambda_cracked_2.png" title="Functional programming expands your mind"><img src="http://blog.malde.org/wp-content/uploads/2008/07/hazard_lambda_cracked_2.png" alt="Functional programming expands your mind" width="161" align="left" height="159" /></a>My current development project is an EST pipeline.  For various reasons, it is implemented in shell &#8212; bash, to be exact.  In other words, the pipeline is a script, or rather a set of scripts, that will tie together the various stages: masking, clustering, assembly, and annotation.</p>
<p>As in any program, there are many occasions where you want to effect some particular change during some part of the program.  The archetypical example is allocation of local variables. After allocation, the variables are then available to the program until they run out of scope, they then get deallocated automatically.  The technique can be generalized beyond this.  For instance, you (or rather I) may want to set a <tt>$STAGE</tt> variable that indicates the current processing stage, and which should be unset when the stage has finished executing.  Or you may want to run some processing in a different directory, in which case you <em>really</em> want to remember to return to the previous directory when you finish.  The purpose of <em>bracketing</em> is to wrap a section of code with an initial part to be run in advance, and a final part to be run afterwards.</p>
<p><span id="more-16"></span></p>
<h2>Some examples</h2>
<p>When I toyed with PHP ages ago, I&#8217;d often find myself building a section of a page by a) generating a header with some opening tags, b) generating some content, and c) generating a footer with some closing tags.  And when the exact contents of these pieces depend on various factors, and such sections would nest in complicated ways, it should not be a surprise that getting a) and c) to correspond exactly could be a challenge.  With absolutely no enforcement by the language (which may have improved in later years, I wouldn&#8217;t know), this was very fragile.</p>
<p>For <a href="http://www.haskell.org/ghc/docs/latest/html/libraries/xhtml/Text-XHtml.html" title="The XHtml Haskell library">HTML</a> the solution is simple: instead of generating open and close tags separately, have a function take the tag and its contents, and output the contents appropriately surrounded by open and close tags.  If you build the whole document this way, you guarantee that each open tag will have a matching close tag, and that tags will be properly nested.  Another example is Common Lisps with-open-file macro.</p>
<p>In Haskell, there&#8217;s <a href="http://www.haskell.org/ghc/docs/latest/html/libraries/base/Control-Exception.html#v%3Abracket" title="Control.Exception.bracket documentation">bracket (in Control.Exception)</a>, which in addition to being vastly more general also is a <a href="http://p-cos.blogspot.com/2007/02/what-is-point-of-macros.html" title="Pascal Constanza: What is the point of macros?">regular function</a>, thus once again proving Haskell&#8217;s vast technical and moral superiority over the more pedestrian languages&#8230;.but I digress.  I suspect the rather original name is supposed to allude to how brackets (as in those banana-shaped glyphs surrounding this text) consist of an opening bracket, some contents, and a closing bracket.  Anyway, we like Haskell, so we use Haskell terminology.</p>
<p>As a final note, observe that brackets are similar to stack allocation, manual resource management is similar to manual memory management, while using finalizers/destructors is similar to garbage collection.  (It&#8217;s tempting to add &#8220;pick any two&#8221;.)</p>
<h2>Generalized bracket</h2>
<p>While implementing the EST pipeline, I found myself needing, and implementing, several bracket-like functions (including the two previously mentioned: setting and unsetting a variable, and running a subcomputation in a separate directory).  Thus, the question that poses itself is:  Is it possible to do this in a more general way, akin to Haskell&#8217;s bracket?  Here&#8217;s my currently best attempt:</p>
<blockquote>
<pre>bracket(){
    CLOSE=$2
    eval $1
    shift; shift
    eval $*
    eval $CLOSE
}</pre>
</blockquote>
<p>First we store the second parameter, which will be the &#8220;close&#8221; action in a variable.  We then execute the first parameter (the &#8220;open&#8221; action), using <tt>eval</tt> so that variables can be set etc.  We then skip the two first arguments, using shift twice, then <tt>eval</tt> the main action, and finally, <tt>eval</tt> the &#8220;close&#8221; action.</p>
<p>This allows stuff like:</p>
<blockquote>
<pre>bracket "mkdir mytmpdir &amp;&amp; pushd mytmpdir" "popd" mkfiles</pre>
</blockquote>
<p>where the <tt>mkfiles</tt> function is run inside a temporary directory, and where execution resumes in the original directory after completion.  Another example is</p>
<blockquote>
<pre>bracket "echo Entering first stage; STAGE=first" "STAGE=none" echo Current stage is '$STAGE'</pre>
</blockquote>
<p>Note the careful quoting of the variable with single quotes, we don&#8217;t want <tt>$STAGE</tt> to evaluated before it is set in the bracket function.  In other words, the single quotes lets us pass the literal string <tt>$STAGE</tt>, sort of pass by name semantics instead of the default pass by value.</p>
<h2>Perfection is the enemy&#8230;</h2>
<p>If you are an experienced shell programmer, you may at this point have formed an opinion that I am not.  And you&#8217;d be right, of course, but even I can see that there&#8217;s (at least) one obvious bug: we define a global variable named <tt>CLOSE</tt>.  Not only does this have the potential to clash with an existing variables, it also prevents recursive calls to <tt>bracket</tt>.  Possibly, we should generate variable names, or have <tt>$CLOSE</tt> be a stack, or something&#8230;But hey, Mr. Know-it-all, if you&#8217;re so damn good, why not post a comment explaining how it&#8217;s <em>really</em> done?</p>
<p>In other words: feedback and comments are most welcome (although for spam-prevention you may have to register first)!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/07/11/functional-bash-bracketing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
