<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Biohaskell</title>
	<atom:link href="http://blog.malde.org/index.php/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.malde.org</link>
	<description>bioinformatics and haskell</description>
	<lastBuildDate>Fri, 19 Feb 2010 21:28:27 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Tools for pyrosequencing analysis</title>
		<link>http://blog.malde.org/index.php/2010/02/19/tools-for-pyrosequencing-analysis/</link>
		<comments>http://blog.malde.org/index.php/2010/02/19/tools-for-pyrosequencing-analysis/#comments</comments>
		<pubDate>Fri, 19 Feb 2010 21:28:27 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Downloads]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=98</guid>
		<description><![CDATA[I recently did a brief presentation of the set of tools I&#8217;ve developed for analyzing pyrosequences (the Roche 454 variety).  Nothing spectacular, just an overview of various ways of slicing and dicing SFF files using tools written in Haskell.  For lack of a better place to put it, I&#8217;ll drop my slides below.
flowers
]]></description>
			<content:encoded><![CDATA[<p>I recently did a brief<img class="alignright" title="Hokusai poppies" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Hokusai_Poppies.jpg/800px-Hokusai_Poppies.jpg" alt="" width="260" height="179" /> presentation of the set of tools I&#8217;ve developed for analyzing pyrosequences (the Roche 454 variety).  Nothing spectacular, just an overview of various ways of slicing and dicing SFF files using tools written in Haskell.  For lack of a better place to put it, I&#8217;ll drop my slides below.</p>
<p><a title="A presentation of tools for working with SFF files" href="http://blog.malde.org/wp-content/uploads/2010/02/flowers2.pdf">flowers</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2010/02/19/tools-for-pyrosequencing-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Searching for poly(A) tails</title>
		<link>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/</link>
		<comments>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 14:03:28 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=65</guid>
		<description><![CDATA[I&#8217;m currently involved in a project where we study, among other things, the 3&#8242;UTR and poly-A tails of certain genes.  For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn&#8217;t find any program or tool to do just that.  Presumably the task is considered too [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m currently involved in a project where we study, among other things, the 3&#8242;UTR and poly-A tails of certain genes.  For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn&#8217;t find any program or tool to do just that.  Presumably the task is considered too trivial?  So, like many other &#8220;trivial&#8221; tasks, it is performed by ad-hoc solutions that are likely to be suboptimal.</p>
<p>Here&#8217;s a better method that identifies poly-A tails by finding an optimal, quality adjusted alignment in linear time.</p>
<p><span id="more-65"></span></p>
<h3>A quick introduction</h3>
<p>Although the definitions of what constitutes a gene vary considerably, we&#8217;ll use the term to refer to a region of DNA that get<em> <a title="Wikipedia entry for &quot;transcription&quot;" href="http://en.wikipedia.org/wiki/Transcription_%28genetics%29">transcribed</a></em>, that is, copied from DNA into an <a title="Wikipedia entry for &quot;Messenger RNA&quot;" href="http://en.wikipedia.org/wiki/Messenger_RNA">mRNA</a> molecule, which in turn will be used as a blueprint for assembling a protein.  After transcription, the mRNA molecule then undergoes <a title="Wikipedia entry for &quot;polyadenylation&quot;" href="http://en.wikipedia.org/wiki/Polyadenylation"><em>polyadenylation</em></a>, a process where a string of adenines (the &#8216;A&#8217; of the nucleotide alphabet) gets appended to an mRNA molecule before it is exported from the nucleus.</p>
<p>Identifying poly-A tails are important for several reasons.</p>
<ol>
<li>It positively identifies the end of the transcript.  If you don&#8217;t have a poly-A in your sequence, you have no way to know how far the molecule extends beyond the end of the sequence.  You can also find alternatively terminated transcripts this way.</li>
<li>It positively identifies the end of the transcript.  Anything after the poly-A tail is linker or vector sequence, and can safely be trimmed off.  Even if, as is often the case, it is too low quality to be recognized by your average vector masking software.</li>
<li>It provides useful information about the transcript, as the poly-A tail is important for things like protecting the mRNA from degradation.</li>
</ol>
<p>Unfortunately utilities often trim poly-A tails by default (e.g. SeqClean), or just ignore it (e.g. BLAST&#8217;s low-complexity<br />
filter).</p>
<h3>Quality based alignment</h3>
<p>When a molecule is sequenced, the analog output from the sequencing machine is stored as a <em>chromatogram</em>.  In order to be useful, the sequence is <em>called</em>, that is, translated to a string of letters from the familiar {A,C,G,T} nucleotides alphabet.  In addition, the base caller will associate each letter with a <em>quality value</em>.  This is derived from an estimate of the probability of the call being incorrect, and for quality value <em>Q,</em> the error probability estimate is</p>
<pre style="padding-left: 30px;">ϵ = 10<sup>-Q/10</sup></pre>
<p>Traditionally sequence alignment simply aligns the string of characters using a fixed positive score (reward) for aligning similar characters, and fixed negative scores (penalties) for either substituting a different character, opening a gap, or extending a previous gap.</p>
<p>However, taking into account the quality value, <a title="My paper on using sequence quality to improve alignments." href="http://bioinformatics.oxfordjournals.org/cgi/content/full/24/7/897">we can do better</a>, and instead of fixed scores, we can adjust the scores dynamically according to quality.</p>
<p>Using this method, the penalty for e.g. aligning two different characters will depend on the quality of the characters: high quality means a high penalty, low quality &#8212; lower penalty (since there&#8217;s a greater chance one of them was incorrectly called).</p>
<h3>Scoring of alignments</h3>
<p>When calculating the score of an alignment, we really want to answer the question: how likely is this sequence to be a real poly-A sequence, as opposed to just a random string?  In other words, we are comparing our sequence against two models: the poly-A model, and the background model. Our score will use the <em>ratio</em> of probabilities of the string being produced by the two models.</p>
<p>For the poly-A model, only As are allowed, so the probability of a character occurring is 1 for As and 0 for the others.  For the background model, we&#8217;ll just take a uniform distribution of nucleotides, each getting a probability of 0.25.</p>
<p>Using this scheme, the score for a string s is simply 1/0.25 = 4 for each A, and 0/0.25 = 0 for all others.  We usually work with the logarithm of these numbers to make them more manageable.</p>
<p>The optimal alignment is then simply the longest run of As, since as soon as you multiply with a zero (or add -infinity, if you use <em>log</em>-scores), you lose the whole score.</p>
<h3>Adding quality to the mix</h3>
<p>Of course, the actual sequence isn&#8217;t perfect, and even the poly-A tail is likely to contain the odd G, C, or T.  To determine exactly <em>how</em> likely is where the quality value enters the picture. Using the formula above,  we can calculate the error estimate and decide what the penalty for a mismatch and reward for a match should be.</p>
<p>For the poly-A model, the probability for a match (that is, an actual &#8216;A&#8217; in the sequence) is <em>1-<em>ϵ</em></em>, the probability of a mismatch (a non-A) is <em>ϵ/3</em> (since only one of the three possible substititutions is an A, and for simplicity, we give them equal probability).  Using the formula for  <em><em>ϵ</em></em> as a function of <em>Q</em> (and hopefully not introducing any errors), I get the scores to be:</p>
<pre style="padding-left: 30px;">match q = log (4*(1-1/10**(q/10)))
mismatch q = log 4 - log 3 - q/10*log 10</pre>
<p>Now, we can use this to do a standard Smith-Waterman alignment, calculating a dynamic programming matrix, and searching for an optimal local alignment.</p>
<p>However, since we&#8217;re aligning against a repeated nucleotide, there&#8217;s no real need for a second dimension, and we can use the following recurrence to calculate the &#8220;polyA-score&#8221; <em>M</em> for each position <em>i</em>:</p>
<p style="padding-left: 30px;"><em>M<sub>i</sub> = max (0, S<sub>i</sub> + M<sub>i-1</sub>)</em></p>
<p>To implement this, we first define the list of scores by applying match and mismatch to the list of (nucleotide,quality) pairs.  We also define a scanl-based function to calculate a list of cumulative scores:</p>
<pre style="padding-left: 30px;">scores = map (\(c,q) -&gt; if toUpper c=='A' then match q else mismatch q) qd
cumulative = scanl (\a b -&gt; let r = a + b in max 0 r)</pre>
<p>The only remaining thing is to identify the maximal value which marks the end of the poly-A tail, and the corresponding 0 value that indicates the start.   I wrote a recursive function called &#8221;findmax&#8221; for this, but a better programmer will probably be able to do this with a fold.</p>
<p>Including the parts discussed briefly above, the whole thing looks like this:</p>
<pre style="padding-left: 30px;">findPolyA :: Sequence Nuc -&gt; Maybe (Int,Int)
findPolyA (Seq _ d mq) =
let qd = zip (B.unpack d) (maybe (repeat 15) BB.unpack mq)
scores = map (\(c,q) -&gt; if toUpper c=='A' then match q else mismatch q) qd
match x' = let x = fromIntegral x' in log (4*(1-1/10**(x/10)))
mismatch x' = let x = fromIntegral x' in log 4 - log 3 - x/10*log 10
cumulative = scanl (\a b -&gt; let r = a + b in max 0 r) 0
(zi,mi,maxscore) = findmax $ cumulative scores
in if maxscore &gt; 12 then Just (zi+1,mi) else Nothing  -- arbitrary constant alert!

findmax :: [Double] -&gt; (Int,Int,Double)
findmax = go 0 (0,0,0) . zip [0..]
where go _ cm [] = cm
go _ cm ((i,0):rest) = go i cm rest
go last_z (cmz,cmi,cmx) ((i,x):rest) = if x &gt; cmx then go last_z (last_z,i,x) rest
else go last_z
(cmz,cmi,cmx) rest</pre>
<h3>Availability</h3>
<p>This method is implemented in a simple tool called &#8220;trimpolya&#8221; (<a title="Darcs repository for 'trimpolya'" href="http://malde.org/~ketil/biohaskell/trimpolya">darcs repo</a>), and also in the more general &#8220;dephd&#8221; (<a title="Darcs repository for 'dephd'" href="http://malde.org/~ketil/biohaskell/dephd">darcs</a>, <a title="Dephd at HackageDB" href="http://hackage.haskell.org/package/dephd">hackage</a>) sequence analysis package.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Installing the software on Ubuntu 9.10</title>
		<link>http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/</link>
		<comments>http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 13:26:08 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Installation]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/</guid>
		<description><![CDATA[Ubuntu 9.10, nicknamed Karmic Koala, is about to be released, and in a moment of idleness, I upgraded my old 9.04 install to the latest beta.  Upgrading is always generates a slight feeling of dread,  taking the plunge from the cozy stability of bugs I&#8217;ve learned to work around, into the great unknown, but it [...]]]></description>
			<content:encoded><![CDATA[<p>Ubuntu 9.10, nicknamed Karmic Koala, is about to be released, and in a moment of idleness, I upgraded my old 9.04 install to the latest beta.  Upgrading is always generates a slight feeling of dread,  taking the plunge from the cozy stability of bugs I&#8217;ve learned to work around, into the great unknown, but it all went even smoother than the previous one.  And on the plus side, ghc is now, <em>finally</em>, upgraded to an almost modern release, (6.10.4) and lots of libraries are included as well.  Great work by Joachim Breitner and his <a href="http://lists.debian.org/debian-haskell/" title="debian-haskell mailing list">army of debianizers</a>.  So I&#8217;m all ready to take advantage of my new compiler and its improvements, but first I need to bring all my software up to speed.  I&#8217;ll make notes here as I go along, and hopefully this will be useful also for users of other Linux distributions.</p>
<p><span id="more-54"></span></p>
<h3>Installing biolib</h3>
<p>First I need to install the bioinformatics library.  I&#8217;m about to release 0.4.1, but this is also a good opportunity to check that everything works with 0.4 (which is what you&#8217;ll find on Hackage), so let&#8217;s do that first.  Using darcs, I pull the repo up to the 0.4 tag (but you can of course get the tarball from <a href="http://hackage.haskell.org/package/bio">Hackage</a>):</p>
<p><code>% ./Setup.hs configure<br />
Configuring bio-0.4...<br />
Setup.hs: At least the following dependencies are missing:<br />
QuickCheck &lt;2, binary -any<br />
</code></p>
<p>(Side note: you may notice that I run <tt>Setup.hs</tt> directly, as opposed to using <tt>runhaskell</tt>.  I prefer it this way, but you may have to do a <tt>chmod +x Setup.hs</tt> if you downloaded this from the darcs repository or similar.)</p>
<p>Since we want to use the system libraries as far as possible, these libraries are just an apt-get away:</p>
<p><code>sudo apt-get install libghc6-quickcheck1\*<br />
sudo apt-get install libghc6-binary-\*<br />
</code></p>
<p>Now, let&#8217;s try again:</p>
<p><code>% ./Setup.hs configure<br />
Configuring bio-0.4...<br />
% ./Setup.hs build<br />
Preprocessing library bio-0.4...<br />
Building bio-0.4...<br />
Binary: Int64 truncated to fit in 32 bit Int<br />
ghc: panic! (the 'impossible' happened)<br />
(GHC version 6.10.4 for i386-unknown-linux):<br />
Prelude.chr: bad argument</code></p>
<p><code>Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug</code></p>
<p>Okay, this was not what was supposed to happen.  As always, dropping to #haskell on IRC is the first thing to do, and sure enough:</p>
<p><code>&lt;sereven&gt; ketil: that's also shown up for xmonad users when .hi and .o files weren't cleaned<br />
between rebuilds mixing different versions. usually between ghc updates IIRC.<br />
</code></p>
<p>Let&#8217;s try to get rid of old cruft lying about, polluting directories:</p>
<p><code>./Setup.hs clean &amp;&amp; ./Setup.hs configure &amp;&amp; ./Setup.hs build</code></p>
<p>Sure enough, this time it worked.  For good measure, we&#8217;ll run the unit tests:</p>
<p><code>make test</code></p>
<p>After a zillion tests, we notice that everything is go, great!</p>
<h3>Applications</h3>
<p>Next, it is time to go through the list of bioinformatics applications.  Since my working directory is a mess of branches and versions, we&#8217;ll just go over the published applications and versions on Hackage.</p>
<p><strong>xsact</strong> is an application to do sequence (in particular EST sequence) clustering.  It predates and thues doesn&#8217;t actually use the bioinformatics library, but we&#8217;ll check it anway.  So we try the familiar command line:</p>
<p><code>./Setup.hs clean &amp;&amp; ./Setup.hs configure &amp;&amp; ./Setup.hs build</code></p>
<p>And things compile.  However, the version on Hackage is outdated, so we&#8217;ll upload a new version, 1.7.1.  One test case still fails, but I can&#8217;t imagine anybody is using it to generate Newick-formatted trees &#8212; I am certainly not &#8212; and since there are many equally correct outputs (including tree rotations and rounding modes), output is likely correct anyway.  Holler if you need it!</p>
<p><strong>rbr</strong> is an application to mask repeats in sequence data.  Normally, this is done using a library of known repeats, but this application tries to do it using statistics, making the &#8212; I think justifiable &#8212; assumption that repeats are going to be more common than non-repeats. The version on Hackage is old, and only works with the library prior to 0.4, so again this is a good time to push the latest changes out in the limelight.  Compiling this works great, by the way.</p>
<p><strong>cluster_tools</strong> is a package that contains a bunch of binaries, useful for working with the results of sequence clustering, including extracting various information from ACE files.  This uses another library, called <em>simpleargs</em>, that simplifies command line argument parsing for simple cases.  Again, the Hackage version is for bio&lt;0.4, so a new version will be pushed.  At the same time, we make a mental note to push version 0.2 of simpleargs to Hackage as well, instead of keeping age-old modifications buried forever.</p>
<p><strong>dephd</strong> is my Swiss-army-knife of sequence analysis, and lets you do various things like converting between formats, plotting and trimming by quality.  This is a more live project than most of the others (I&#8217;m currently working on improved quality trimming and automatic generation of files for submission to GenBank), but the currently available version also compiles without incident.</p>
<p><strong>estreps</strong> is a couple of programs I needed for repeat analysis, perhaps not tremendously interesting, but at least <tt>rselect</tt>, which lets you select randomized subsets from Fasta files, might be of interest to some?  We try the usual invocation to compile, and get:</p>
<p><code>src/Unigene.hs:24:23:<br />
Couldn't match expected type `a' against inferred type `Unknown'<br />
`a' is a rigid type variable bound by<br />
the type signature for `clusters' at src/Unigene.hs:22:41<br />
</code></p>
<p>This error arises due to the introduction of phantom types for identifying sequences introduced in bio 0.4.  Unfortunately, this version of <em>estreps</em> contains some adaption to this model, so it won&#8217;t compile against older versions either.  So it looks like yet another sdist for Hackage.  Look for version 0.3.1.</p>
<p><strong>flower</strong> is a utility for extracting information from SFF files (containing sequences from Roche&#8217;s 454 machines).  Although a new version is around the corner, the old 0.2 just works.</p>
<p><strong>xml2x</strong> is a utility for converting BLAST results in XML format into CSVs, that somehow is more compatible with biologists.  Trying to compile it fails, with the following error:<br />
<code><br />
src/Xml2X.hs:152:49:<br />
Couldn't match expected type `[b]'<br />
against inferred type `Maybe Bio.Sequence.KEGG.KO'<br />
In the first argument of `concatMap', namely `(flip M.lookup ks)'<br />
In the first argument of `($)', namely<br />
`concatMap (flip M.lookup ks)'<br />
In the second argument of `($)', namely<br />
`concatMap (flip M.lookup ks) $ map chop $ map subject fs'<br />
</code></p>
<p>It turns out that somewhere along the way, the <tt>lookup</tt> function from <tt>Data.Map</tt> was <a href="http://hackage.haskell.org/trac/ghc/ticket/2309" title="GHC ticket">de-generalized</a> from working on arbitrary monads to just returning a <tt>Maybe</tt>.  I was using this to return a list, using the empty list to signal an unsuccessful lookup.  This is easily remedied, but that means yet another sdist for Hackage.</p>
<p><strong>korfu</strong> is a utility for identifying open reading frames in sequence data.  It hasn&#8217;t yet been ported to version 0.4 of the library, but works if you install 0.3.5.  I updated this too, since it didn&#8217;t have a category.  Now it too resides in the bioinformatics section.</p>
<h3>Summing up</h3>
<p>In retrospect, it seems like giving old code a thorough spring cleaning once in a while.  Although nothing really critical or difficult happened, a good number of small annoyances were discovered, and a bunch of new sdists are now ready to be uploaded to Hackage.   Next will be converting all this into debian packages.</p>
<p>The important question is of course, how do we avoid this in the future?  During development, it is important to be able to modify libraries and appliations, but installing a new version of the biolib, say, overwrites the old one, and suddenly I&#8217;m compiling and testing everything against a different library than Joe Random Hackage User is going to find.  I have some thoughts on how to avoid this, but if you have a method that works nicely, I&#8217;m all ears.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A (too) brief Biohaskell presentation</title>
		<link>http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/</link>
		<comments>http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/#comments</comments>
		<pubDate>Tue, 15 Sep 2009 12:27:00 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Downloads]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/</guid>
		<description><![CDATA[I was recently in Trondheim, and got an opportunity to present Haskell to an audience of bioinformaticians.  Alas, it is hard to describe Haskell in all its glory to the uninitiated in forty-five minutes, and especially when I also wanted to talk a bit about the application to bioinformatics.  I left in the belief that [...]]]></description>
			<content:encoded><![CDATA[<p>I was recently in Trondheim, and got an opportunity to present Haskell to an audience of bioinformaticians.  Alas, it is hard to describe Haskell in all its glory to the uninitiated in forty-five minutes, and especially when I also wanted to talk a bit about the application to bioinformatics.  I left in the belief that I managed to communicate some of the ideas, and submit the slides and other material here for posterity.  (I&#8217;m happy to receive comments, too, just in case I&#8217;ll do a revised version of the talk later on).</p>
<ul>
<li> <a href="http://blog.malde.org/wp-content/uploads/2009/09/biohaskell.pdf" title="biohaskell slides">My biohaskell slides</a></li>
<li><a href="http://blog.malde.org/wp-content/uploads/2009/09/biohaskell.tex" title="slides’ LaTeX source, using beamer.cls">slides’ LaTeX source, using beamer.cls</a></li>
<li><a href="http://blog.malde.org/wp-content/uploads/2009/09/lpssm.ps" title="Lazy PSSM  paper">Lazy PSSM JFP-paper</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing ints</title>
		<link>http://blog.malde.org/index.php/2009/08/31/parsing-ints/</link>
		<comments>http://blog.malde.org/index.php/2009/08/31/parsing-ints/#comments</comments>
		<pubDate>Mon, 31 Aug 2009 13:06:35 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/08/31/parsing-ints/</guid>
		<description><![CDATA[A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the quality data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://upload.wikimedia.org/wikipedia/commons/f/ff/Cempedak_opened1.JPG" alt="Artocarpus integer" align="right" width="376" height="250" />A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the <em>quality</em> data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on the slow side, I discovered that although parsing Fasta sequence files (with <strong>readFasta</strong> from <strong>Bio.Sequence.Fasta</strong>) processes at a rate in excess of 200MB/s on my laptop, reading sequences with quality data (using <strong>readFastaQual</strong>) was much slower, about 2-3 MB/s.   After some investigation and a few rewrites, it&#8217;s up to about 15MB/s, but still pretty far from plain sequence.  Below are three (and a half) different versions, and the hope somebody can improve on it even furter.</p>
<p><span id="more-48"></span></p>
<h3>The original version</h3>
<p>I&#8217;ve taken the liberty of cleaning things up a bit, and removing some context that I sure hope isn&#8217;t necessery.  Bascially, the task is to take a list of ByteString input lines consisting of whitespace separated decimal integers, and build a ByteString consisting of single byte quality values corresponding to those integers.</p>
<p>Below is how the naive quality parsing function might look, unpacking each word to <strong>String</strong> in order to use <strong>read</strong>.  Note that <strong>B</strong> is lazy ByteString.Char8, <strong>BB</strong> is lazy ByteString, that is, the <strong>Word8</strong> version.</p>
<p><tt> </tt></p>
<pre><tt>BB.pack $ map (read . B.unpack) $ B.words $ B.unlines ls
</tt></pre>
<p><tt></tt></p>
<p>We don&#8217;t expect this to do terribly well, and I guess it&#8217;s no surprise when this parses my test file of about 10MB in 24 seconds.</p>
<h3>Improved versions</h3>
<p>The key to improved performance is first and foremost to avoid the unpacking and parsing of strings.  Thankfully, the ByteString library provides a <strong>readInt</strong> function.</p>
<p><tt> </tt></p>
<pre><tt>
BB.pack [lookup x | x &lt;- concatMap B.words ls]
    where
    lookup x = case B.readInt x of Just (v,_) -&gt; fromIntegral v
                                   Nothing -&gt; error "Unparsable qual value"
</tt></pre>
<p>This isn&#8217;t a lot more complicated than our initial attempt, but the speed increas is considerable: slightly less than 2 seconds for the test file, more than a tenfold improvement.  The ByteString implementation will share the storage of the separate words with the original strings, but since <strong>readInt</strong> gives us back the rest of the string in addition to the parsed integer, we might as well make use of it:</p>
<pre><tt>
BB.pack $ readInts $ B.unlines ls
    where readInts xs = case B.readInt xs of </tt></pre>
<pre><tt>                          Just (i,rest) -&gt; fromIntegral i : readInts (B.dropWhile isSpace rest)
                          Nothing -&gt; []
</tt></pre>
<p>This turns out to be a bit faster, time is now 1.6 seconds.  Another 20% shaved off.</p>
<h3>Final version</h3>
<p>We&#8217;re really only interested in <strong>Word8</strong> values, since quality values always are small, and since that&#8217;s what gets encoded in the result anyway.  The previous versions takes a detour by reading <strong>Int</strong>s and using <strong>fromIntegral</strong> to convert them to the desired size.  It bears noting that there is no error checking involved, <strong>fromIntegral</strong> will happily and silently truncate any number beyond its target range.  So lets do things explicitly, using <strong>Word8</strong>s throughout the computation:<br />
<tt> </tt></p>
<pre><tt>
BB.pack $ go 0 ls
    where
    isDigit x = x &lt;= 58 &amp;&amp; x &gt;= 48
    go i (s:ss) = case BB.uncons s of </tt></pre>
<pre><tt>                    Just (c,rs) -&gt; if isDigit c then go (c - 48 + 10*i) (rs:ss)
                                   else let rs' = BB.dropWhile (not . isDigit) rs
                                        in if BB.null rs' then i : go 0 ss else i : go 0 (rs':ss)
                    Nothing -&gt; i : go 0 ss
    go _ [] = []
</tt></pre>
<p>This is the fastest one so far, clocking in at 0.94 seconds, over 40% faster than the best <strong>readInt</strong> version, and about 25 times faster than the naive version.  Still, 10MB/s is well below the average hard disk.</p>
<p>So is there more room for improvement?  The most obvious wart to me is the rather artificial splitting into lines.  This is mostly an artifact of some early design desicions, and it should be possible to eliminate the splitting earlier on and saving even more time by simplify this function quite a bit.</p>
<p>If you spot anything else, or have suggestions, I (and my darcs repo) am all ears.</p>
<p><strong>Edit:</strong> Since some people have asked, I&#8217;ve wrapped up a simple test program along with some test files at <a href="http://malde.org/~ketil/biohaskell/qualparsetest ">http://malde.org/~ketil/biohaskell/qualparsetest</a>.  This is a simplified version, if you want to be <em>really</em> helpful, you could always look at <strong>Bio.Sequence.Fasta</strong> in the <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">Bioinformatics library</a> and see if you can speed up e.g. <em>dephd -i input.fasta input.qual -F /dev/null</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/08/31/parsing-ints/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A set of tools for working with 454 sequences</title>
		<link>http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/</link>
		<comments>http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/#comments</comments>
		<pubDate>Fri, 03 Jul 2009 21:32:10 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[sequence analysis]]></category>
		<category><![CDATA[sff]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/</guid>
		<description><![CDATA[Pyrosequencing is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional Sanger sequencing as previous-generation sequencing) as it produces large amounts of sequences at lower costs.  As the technology is radically different, so are the type of data that results from it, and while it is possible [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Dhalia_flower.jpg/800px-Dhalia_flower.jpg" alt="Random flower courtesy of wikimedia" width="212" height="158" align="left" /><a title="Pyrosequencing from Wikipedia" href="http://en.wikipedia.org/wiki/Pyrosequencing">Pyrosequencing</a> is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional <a title="Old school sequencing, Wikipedia" href="http://en.wikipedia.org/wiki/DNA_sequencing">Sanger sequencing</a> as previous-generation sequencing) as it produces large amounts of sequences at lower costs.  As the technology is radically different, so are the type of data that results from it, and while it is possible to use many of the same software tools for working with new sequences, there is a clear need for specific ones as well.</p>
<p>The <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">Haskell bioinformatics library </a>has for some time now supported reading and writing the SFF format, which is used by the oldest (previousest?) of the next generation technologies, namely Roche&#8217;s 454 sequencing.  Once the library functionality is in place, it is easy to develop small tools for doing the various chores.   After spending some time in anticipation of the hordes of programmers no doubt rushing to exploit the monumental effort put I down in  the library, I&#8217;ve instead written a few programs myself, including tools information/statistics extraction (flower),  extracting sequences by various criteria (fselect), simulating sequencing (pyrosim), and repairing broken SFF files (frepair).  This is their story.<span id="more-45"></span></p>
<p>The bioinformatics library includes functions and data structures to read and write and extract data from SFF files.  It provides reasonable performance (meaning that in most cases, my disk is the limiting factor).  It is also a clean room implementation, built from <a title="454 manual (see page 528++)" href="http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf">official documentation</a>, but not based on Roche&#8217;s (or anybody else, for that matter) code.  It comes with an LGPL license, which I hope will make it useful while encouraging back contributions.</p>
<h3>Flower</h3>
<p><a title="Previous article on flower" href="http://blog.malde.org/index.php/flower/">This program</a> can extract  various information from SFF files.  To some extent, it overlaps with &#8217;sffinfo&#8217; and &#8217;sfffile&#8217; from 454, but in addition to generating Fasta sequences (optionally with quality), it can also extract directly to the more compact FastQ format.  It also can output a (huge) table of flow values, the histogram of flow values (useful for estimating the flow distributions, and thus probabilities for the base calls), or a one-line summary of each read that includes some statistics on lenghts and quality.  Here&#8217;s the usage info:</p>
<blockquote>
<pre>flower: Usage: flower -[f|q|r|R] &lt;file.sff&gt; [&lt;file2.sff&gt; ..]
  -r  output reads in Fasta format
  -R  output reads in Fasta format with associated .qual
      (generates files instead of writing to &lt;stdout&gt;)
  -q  output in FastQ format
  -f  output the flowgram in tabular format
  -h  output a histogram table of flow values
  -s  output a summary of each read</pre>
</blockquote>
<h3>FSelect</h3>
<p>FSelect takes an SFF file and produces a new SFF file containing a subset of the sequences, using the same statistics as Flower can output.  It has a small expression language built in, so that you can build more complex logical queries.  For instance, if you want to extract the sequences with lenght between 300 and 400, and with a K² score greater than 0.7, you could do</p>
<blockquote>
<pre>fselect "And (Func LT k2 0.7) (And (Func GT len 200) (Func LT len 400))" FL61AHU01.sff</pre>
</blockquote>
<p>Okay, it&#8217;s a bit clunky, but the syntax should be reasonably straightforward: Logical operators are <tt>And</tt>, <tt>Not</tt>, and <tt>Or</tt>, while <tt>Func op f v</tt> defines a function using <tt>op</tt> (either <tt>LT</tt> or <tt>GT</tt>) to compare <tt>f </tt>(one of <tt>k2, len, tlen, ncount</tt>)  to each read.  Output is generated in a file named &#8220;selected.sff&#8221;.<br />
FSelect can also select random reads, using the select expression <tt>"Rand p"</tt>, where p is the probability for selection.  (I.e. <tt>fselect "Rand 0.2" FL61AHU01.sff </tt> will select each read with a probability of 0.20, giving you approximately 20% of the reads.  Random selection can not be combined with other criteria at this point, if you want this, you&#8217;ll have to run <tt>fselect</tt> multiple times.</p>
<h3>FRecover</h3>
<p>We had some issues with broken SFF files, specifically there were block of zero bytes at random places. Both <tt>flower</tt> and <tt>sffinfo</tt> just terminate on encountering a broken read, so I implemented a simple utility that attempts to skip the broken block and continue extracting good reads beyond the trouble.</p>
<h3>PyroSim</h3>
<p>This attempts to simulate pyrosequencing, for now it only does the GS20  generation of 454 sequencing, but the other generations should be easy to add.  The main problem is that the algorithm for quality calling is insufficiently documented.  GS20 has the advantage that quality for a homopolymer is uniquely derived from the flow value (modulo rounding), so reverse engineering it is fairly straightforward.</p>
<p>Anyway, <tt>pyrosim</tt> takes a &#8216;generation&#8217;  (at the moment, this is only GS20) and a Fasta file as input parameters, picks random points in the Fasta file, and produces the correspondig flowgram, including a suitable perturbation of the values to introduce the expected measure of noise, calls the bases and quality, and outputs an SFF file.</p>
<h3>Availability</h3>
<p>Flower, FSelect and FRecover are part of the &#8220;flower&#8221; package, available from the <a title="Darcs revision control system" href="http://darcs.net/">darcs</a> archive at <a title="flower darcs repo" href="http://malde.org/~ketil/biohaskell/flower">http://malde.org/~ketil/biohaskell/flower</a></p>
<p>PyroSim is available separately (for now, at least) from the darcs archive at  <a title="pyrosim darcs repo" href="http://malde.org/~ketil/biohaskell/pyrosim">http://malde.org/~ketil/biohaskell/pyrosim</a></p>
<p>I try to upload what I consider stable versions to HackageDB, please check the <a title="Bioinformatics at Hackage" href="http://hackage.haskell.org/packages/archive/pkg-list.html#cat:bioinformatics">Bioinformatics</a> category there.  Currently, these programs are in a bit of flux, so going with the darcs repo is probably a good idea at this point.</p>
<p>I&#8217;d really like to have packages for the most common Linux distributions as well (i.e. .debs and .rpms), but I don&#8217;t know the details of how to produce them, and while I&#8217;ve made half-hearted attempts in the past, I guess I just don&#8217;t really desire it enough.  I&#8217;d be happy to see somebody package it up, so if you know how, please go ahead.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dephd updates</title>
		<link>http://blog.malde.org/index.php/2009/06/16/dephd-updates/</link>
		<comments>http://blog.malde.org/index.php/2009/06/16/dephd-updates/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 12:14:02 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/06/16/dephd-updates/</guid>
		<description><![CDATA[Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the basecaller phred, but it has since grown a bit beyond that.  A new update was just pushed onto HackageDB, this is just a quick note describing new features.
Filtering out empty sequences.
Phred often produces [...]]]></description>
			<content:encoded><![CDATA[<p>Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the <a href="http://en.wikipedia.org/wiki/Phred_base_calling" title="Phred base calling on Wikipedia">basecaller <tt>phred</tt></a>, but it has since grown a bit beyond that.  A new update was just pushed onto <a href="http://hackage.haskell.org/package/dephd" title="dephd on Hackage">HackageDB</a>, this is just a quick note describing new features.<span id="more-44"></span></p>
<h3>Filtering out empty sequences.</h3>
<p><tt>Phred</tt> often produces zero-length sequences, and this confuses other programs.  While <tt>BLAST</tt> will just output a warning, <tt>SeqClean</tt> &#8212; or to be precise, <tt>cln2qual</tt> &#8212; will break down.   (My own code using the Bioinformatics library treats all sequences the same regardless of length, so zero-length sequences are perfectly okay). Anyway, you can now use <tt>dephd -z</tt> to eliminate them from the output.</p>
<h3>Sequence Clipping</h3>
<p>Sequence trimming or clipping is often necessary to remove contamination like vector sequence, or simply low quality sequence parts.  Typically, both of these occur at the ends of the sequences.  Many programs (including dephd, but also phred, lucy, seqclean and others) add trimming information to the sequence header.  Dephd is now able to act on this information and clip the sequences.  The trimming information is now obsolete as the coordinates have changed, so they are replaced with the clipping coordinates.   Dephd also provides its own quality assessment, and with the -q option, sequence ends where the sliding windown average quality is below 15 will be clipped.  This is pretty heavy-handed, but it seems I get better EST clustering with this enabled.</p>
<h3>Old features</h3>
<p>Of course we retain the old features: reads PHD and Fasta/Qual files, mask (to lower case/N but don&#8217;t clip) by quality, generate quality plots, outputs Fasta/Qual, and ranking sequences by quality.</p>
<p><em>Edit:</em> In the latest release, there&#8217;s now also a fix for a problem with drawing quality graphs with gnuplot.  It turns out that my shiny new Ubuntu ships with gnuplot 4.2, but your crappy old distribution ships with an older version, and that there are some incompatibilities in the input formats.  I&#8217;ve now reverted this to use old-style format only, so hopefully it should work with gnuplots back to 3.7 or so.   And for those SLS or MCC interrim die-hards out there, I&#8217;ll even add an option to dump the gnuplot file itself, so that you can copy it to a floppy and generate the plots on a computer with a modern color display. How&#8217;s that for user friendly?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/06/16/dephd-updates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using a phantom type to label different kinds of sequences</title>
		<link>http://blog.malde.org/index.php/2009/05/14/using-a-phantom-type-to-label-different-kinds-of-sequences/</link>
		<comments>http://blog.malde.org/index.php/2009/05/14/using-a-phantom-type-to-label-different-kinds-of-sequences/#comments</comments>
		<pubDate>Thu, 14 May 2009 11:00:14 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[phantom types]]></category>
		<category><![CDATA[sequences]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/05/14/using-a-phantom-type-to-label-different-kinds-of-sequences/</guid>
		<description><![CDATA[Until now (version 0.3.5) of the bioinformatics library, the Sequnce data type has essentially been a wrapper around a couple of strings, with only the most rudimentary and generic structure.  This has the advantage that you can easily work with different kinds of sequences without caring about the particulars, but of course, nothing stops you [...]]]></description>
			<content:encoded><![CDATA[<p>Until now (version 0.3.5) of the bioinformatics library, the Sequnce data type has essentially been a wrapper around a couple of strings, with only the most rudimentary and generic structure.  This has the advantage that you can easily work with different kinds of sequences without caring about the particulars, but of course, nothing stops you from comparing a nucleotide sequence to a protein letter by letter.  We&#8217;d like some more safety without sacrificing flexibility, and by using phantom types we can get this.  Below is my attempt at implementing this.</p>
<p><span id="more-42"></span> The safety that we seek can be exemplified by sequence alignment or similarity scores, as calculated by e.g. the BLAST suite of programs.  Here we have e.g. blastn for comparing nucleotide sequences, blastp for comparing amino acid sequences, and blastx for comparing nucleotides to proteins by first translating the nucleotides to the possible corresponding amino acid sequences. We&#8217;ll aim for an align function that does the right thing for the various combination of sequence types.</p>
<h3>The old approach</h3>
<p>Previously, there was only one Sequence data type, defined as:</p>
<pre>
data Sequence = Seq SeqLabel SeqData (Maybe QualData)</pre>
<p>(where <tt>SeqLabel</tt>, <tt>SeqData</tt> and <tt>QualData</tt> are represented as various <tt>ByteStrings,</tt> but you can think of them as synonyms for <tt>String</tt>)</p>
<p>As mentioned, this isn&#8217;t really leveraging the type system much, and at the very least, it&#8217;d be a good thing to be able to differentiate between nucleotide and  amino acid sequence (a.k.a. peptides or proteins).</p>
<h3>The algebraic data type approach</h3>
<p>The default functional programming approach would be to use an algebraic (sum) type, what the imperative programmers woudl call a union.</p>
<pre>
data Sequence = Nucleotide ... | Protein ...</pre>
<p>This is a good solution when the two branches have different structure, but here you&#8217;d essentially just repeat the structure.  Moreover, all functions will need to do run-time checks for each case, imposing a cost both in function complexity, running time, and type safety.  Note that performing the alignment usually requires a score matrix describing the penalty for replacing any given character with any other.</p>
<pre>
align :: Matrix -&gt; Sequence -&gt; Sequence -&gt; Alignment

align mx (Nucleotide ...) (Nucleotide ...) =
align mx (Nucleotide ...) (Protein ...) =
align mx (Protein ...) (Nucleotide ...) =
align mx (Protein ...) (Protein ...) =</pre>
<p>This is not even complete, since a similar restriction applies to the score matrices &#8211; sometimes it will contain replacement penalties for the nucleotide alphabet, and sometimes it will contain penalties for the amino alphabet, and the appropriate matrix must be used in each brach of the align function.</p>
<p>Another problem with this is that this requires you to over-specify the type.  Sometimes you don&#8217;t know or care what kind of sequence you have.  Say you are selecting sequences by name from a Fasta file.  Since the file format is the same you don&#8217;t really care what kind of sequence it is, and forcing it to be one or the other is&#8230;.immoral.</p>
<h3>The third way: phantom types</h3>
<p>The chosen approach is instead to tag the Sequence type with a phantom type parameter, phantom meaning it does not affect the actual representation of the data.  It looks like this:</p>
<pre>data Sequence t = Seq ....  -- but no data member of type 't'!</pre>
<p>Now, we can write our alignment functions, and make them safer to use:</p>
<pre>align :: Matrix t -&gt; Sequence t -&gt; Sequence t -&gt; Alignment
alignX :: Matrix Amino -&gt; Sequence Nuc -&gt; Sequence Amino -&gt; Alignment</pre>
<p>Note that we also phantom-typed the Matrix type.  With this approach,  incorrect usage like comparing sequences of different type with the generic align will be flagged by the compiler, so run time checks are no longer necessary.</p>
<p>On the other hand, readFasta can now be given the type:</p>
<pre>readFasta :: Filepath -&gt; IO (Sequence a)</pre>
<p>or:</p>
<pre>readFasta :: FilePath -&gt; IO (Sequence Unknown)</pre>
<p>depening on how much you trust the programmer.  Since I&#8217;m writing most of the programs using this library, I know how much you can trust application programmers, so the second and safer method is chosen.</p>
<h3> In practice</h3>
<p>The old biolib repository has now been replaced by two new ones: biolib-stable, currently containing version 0.3.5, and biolib-unstable at 0.4.0. Perhaps unsurprisingly, the latter version contains the phantomly typed Sequence definition.</p>
<p>Currently, three types are used for tags:  Amino, which is the type for the amino acid alphabet,and Nuc and Unknown, which are dummy types without any data constructors, and used solely for this purpose.</p>
<p>So what are the experiences so far?  Well, it does complicate things.   For some file types the sequence type is known, for instance the output of nucleotide sequencing machines in the form of ABI, SCF, or SFF files.  But often, file formats are agnostic with respect to sequence types, the most ubiquitous offender being the Fasta format. Currently, this is handled by the Unknown type tag, but I&#8217;m not entirely convinced this is the optimal solution.</p>
<p>Code needs to be updated, but mostly this is relatively easy.  The type system will spot the difficulties, and often you can get by by just replacing Sequence in the type signatures with Sequence t.  Just make sure that t isn&#8217;t already used in the signature &#8211; I stumbled into this one.</p>
<p>Of course, this is one step towards increased leveraging of the type system.  One could go further, and  tag different type of nucleotide sequences based on sequencing technology: Sanger sequencing has quite different error characteristics from 454 sequencing, and Solexa has a completely different interpretation of the quality values than either of those.  Also, one might want to include the presence or absence of quality data in the type as well.  There are also different amino acid alphabets &#8211; should they have different types?  This is a difficult design space &#8211; how much information should be encoded in the type?</p>
<p>While I think this is an improvement, I&#8217;m very curious how this works out in practice, or if there are other options I should consider.  Please comment!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/05/14/using-a-phantom-type-to-label-different-kinds-of-sequences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Current developments&#8230;</title>
		<link>http://blog.malde.org/index.php/2009/03/13/current-developments/</link>
		<comments>http://blog.malde.org/index.php/2009/03/13/current-developments/#comments</comments>
		<pubDate>Fri, 13 Mar 2009 15:59:45 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/03/13/current-developments/</guid>
		<description><![CDATA[In my vacation, I experimented with phantom types for the Sequence data type.  Bascially, we want nucleotide and protein sequences to have the same representation, and mostly use the same algorithms, but sometimes we need to distinguish them, so as not to inadvertently treat a protein as a nucleotide sequence.  A more detailed writeup is [...]]]></description>
			<content:encoded><![CDATA[<p>In my vacation, I experimented with phantom types for the Sequence data type.  Bascially, we want nucleotide and protein sequences to have the same representation, and mostly use the same algorithms, but sometimes we need to distinguish them, so as not to inadvertently treat a protein as a nucleotide sequence.  A more detailed writeup is in the works, but currently, I&#8217;ve pushed the darcs repo to <a href="http://malde.org/~ketil/biohaskell/biolib-phantom/">http://malde.org/~ketil/biohaskell/biolib-phantom/</a> so if everything works out, this will be the next release (0.4).  (Note to self: we now have a stable and a development branch.  Almost like a serious and all grown up software project. Professionality &#8211; Yay!)</p>
<p>Also, since short reads are all the rage, and my flower program appears to be used a bit, I&#8217;ve done a quick writeup of its features and use as a <a href="http://blog.malde.org/index.php/flower/" title="Flower page">static page</a>.  I&#8217;ll try to keep it updated as things progress.  Popularity &#8211; Yay!</p>
<p>Finally, I got some help compiling everything on some less mainstream operating systems (&#8221;Windows&#8221;, I think it is called).  Mostly, things appear to work, and some improvements &#8211; albeit portability-neutral ones &#8211; were made.  Portability  &#8211; Yay!.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/03/13/current-developments/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from PADL</title>
		<link>http://blog.malde.org/index.php/2009/02/11/notes-from-padl/</link>
		<comments>http://blog.malde.org/index.php/2009/02/11/notes-from-padl/#comments</comments>
		<pubDate>Wed, 11 Feb 2009 09:10:06 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/02/11/notes-from-padl/</guid>
		<description><![CDATA[Time flees.  It&#8217;s already been a while since PADL in Savannah, where I had the opportunity to enjoy talks in topics I mostly managed to follow and meet interesting and interested people.  Thanks to the organizers and committees making it all possible.  I presented a paper on Bloom filters that Bryan O&#8217;Sullivan and I wrote, [...]]]></description>
			<content:encoded><![CDATA[<p>Time flees.  It&#8217;s already been a while since PADL in Savannah, where I had the opportunity to enjoy talks in topics I mostly managed to follow and meet interesting and interested people.  Thanks to the organizers and committees making it all possible.  I presented a paper on Bloom filters that Bryan O&#8217;Sullivan and I wrote, and thought I&#8217;d make the paper available along with the slides (expanded somewhat), and a couple of ideas for extending Bloom filters that I think are original (or &#8220;novel&#8221;, as they say in science).</p>
<p><span id="more-31"></span>First things first, here are the files:<a href="http://blog.malde.org/wp-content/uploads/2009/02/padl.pdf" title="Bloom filters for bioinformatics"></a></p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/padl.pdf" title="Bloom filters for bioinformatics">Bloom filters for bioinformatics</a> &#8211; paper</p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/bloomfilter.pdf" title="Bloom filters for bioinformatics">Bloom filters for bioinformatics</a> &#8211; slides</p>
<p>The new addition to the slides consist of the counting bloom filters, and a brief overview on how to locate matches in a Bloom filter.  The standard counting Bloom filter consists of replacing the array of bits with an array of bit buckets (of size <em>b</em>, say).  This lets the filter count up to <em>2^b</em> occurrences of each element, typically using saturating counts.  If you want to retain the false positive rate, this means you&#8217;ll expand the size of the Bloom filter by a factor of <em>b</em>, which can be quite significant. In the example below, <em>b</em> is set to 3, and the filter saturates at a count of 7.</p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/bloom_filter_counting.png" title="counting bloom filter"><img src="http://blog.malde.org/wp-content/uploads/2009/02/bloom_filter_counting.png" alt="counting bloom filter" style="background: white none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial" /></a></p>
<p>Therefore, I propose a different kind of counting bloom filter, using a distributed count.  Keep in mind that we have an infinite sequence of hash functions.  When inserting a new value, we can check the <em>k</em> hash functions to see if it&#8217;s already there.  If so, we calculate some <em>additional</em> hash functions to represent the count of two for this value.  And so on, until we find a count (and corresponding set of hashes) that have at least one zero bit, set those bits to one, and move on to the next value.  It may make sense here to decrease the number of hash values as the count goes up.  In the example below, we see that x is already inserted, and two new hash values are calculated to increase the count.</p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/bloom_filter_counting2.png" title="distributed counting bloom filter"><img src="http://blog.malde.org/wp-content/uploads/2009/02/bloom_filter_counting2.png" alt="distributed counting bloom filter" style="background: white none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial" /></a></p>
<p>There&#8217;s a trade off, of course.  Insertion and lookup is no longer (as) constant, but  proportional to value count. However, the Bloom filter is likely to be more compact &#8211; given <em>n</em> values in total, where <em>u</em> are unique, the standard counting filter needs a size proportional to <em>bu</em>, while the distributed counting filter needs size proportional to <em>n</em>.  If you want to count accurately, <em>b</em> needs to be set to the logarithm of the max count of any value.</p>
<p>Finally, it&#8217;s also possible to use Bloom filters for searching &#8211; for instance, locating unique words in a genome. A lookup in a Bloom filter basically gives one bit of information &#8211; present or not present.  We use this with a series of <em>m</em> Bloom filters, each indexing the regions of the genome corresponding to one bit of location information.  Looking up a unique word in the set of filters reveals the location with<em> m</em> bits of precision.</p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/locating.png" title="locating bloom filters"><img src="http://blog.malde.org/wp-content/uploads/2009/02/locating.png" alt="locating bloom filters" style="background: white none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/02/11/notes-from-padl/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
