<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Biohaskell &#187; EST analysis</title>
	<atom:link href="http://blog.malde.org/index.php/category/est-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.malde.org</link>
	<description>bioinformatics and haskell</description>
	<lastBuildDate>Tue, 20 Jul 2010 15:04:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Searching for poly(A) tails</title>
		<link>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/</link>
		<comments>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 14:03:28 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=65</guid>
		<description><![CDATA[I&#8217;m currently involved in a project where we study, among other things, the 3&#8242;UTR and poly-A tails of certain genes.  For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn&#8217;t find any program or tool to do just that.  Presumably the task is considered too [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m currently involved in a project where we study, among other things, the 3&#8242;UTR and poly-A tails of certain genes.  For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn&#8217;t find any program or tool to do just that.  Presumably the task is considered too trivial?  So, like many other &#8220;trivial&#8221; tasks, it is performed by ad-hoc solutions that are likely to be suboptimal.</p>
<p>Here&#8217;s a better method that identifies poly-A tails by finding an optimal, quality adjusted alignment in linear time.</p>
<p><span id="more-65"></span></p>
<h3>A quick introduction</h3>
<p>Although the definitions of what constitutes a gene vary considerably, we&#8217;ll use the term to refer to a region of DNA that get<em> <a title="Wikipedia entry for &quot;transcription&quot;" href="http://en.wikipedia.org/wiki/Transcription_%28genetics%29">transcribed</a></em>, that is, copied from DNA into an <a title="Wikipedia entry for &quot;Messenger RNA&quot;" href="http://en.wikipedia.org/wiki/Messenger_RNA">mRNA</a> molecule, which in turn will be used as a blueprint for assembling a protein.  After transcription, the mRNA molecule then undergoes <a title="Wikipedia entry for &quot;polyadenylation&quot;" href="http://en.wikipedia.org/wiki/Polyadenylation"><em>polyadenylation</em></a>, a process where a string of adenines (the &#8216;A&#8217; of the nucleotide alphabet) gets appended to an mRNA molecule before it is exported from the nucleus.</p>
<p>Identifying poly-A tails are important for several reasons.</p>
<ol>
<li>It positively identifies the end of the transcript.  If you don&#8217;t have a poly-A in your sequence, you have no way to know how far the molecule extends beyond the end of the sequence.  You can also find alternatively terminated transcripts this way.</li>
<li>It positively identifies the end of the transcript.  Anything after the poly-A tail is linker or vector sequence, and can safely be trimmed off.  Even if, as is often the case, it is too low quality to be recognized by your average vector masking software.</li>
<li>It provides useful information about the transcript, as the poly-A tail is important for things like protecting the mRNA from degradation.</li>
</ol>
<p>Unfortunately utilities often trim poly-A tails by default (e.g. SeqClean), or just ignore it (e.g. BLAST&#8217;s low-complexity<br />
filter).</p>
<h3>Quality based alignment</h3>
<p>When a molecule is sequenced, the analog output from the sequencing machine is stored as a <em>chromatogram</em>.  In order to be useful, the sequence is <em>called</em>, that is, translated to a string of letters from the familiar {A,C,G,T} nucleotides alphabet.  In addition, the base caller will associate each letter with a <em>quality value</em>.  This is derived from an estimate of the probability of the call being incorrect, and for quality value <em>Q,</em> the error probability estimate is</p>
<pre style="padding-left: 30px;">ϵ = 10<sup>-Q/10</sup></pre>
<p>Traditionally sequence alignment simply aligns the string of characters using a fixed positive score (reward) for aligning similar characters, and fixed negative scores (penalties) for either substituting a different character, opening a gap, or extending a previous gap.</p>
<p>However, taking into account the quality value, <a title="My paper on using sequence quality to improve alignments." href="http://bioinformatics.oxfordjournals.org/cgi/content/full/24/7/897">we can do better</a>, and instead of fixed scores, we can adjust the scores dynamically according to quality.</p>
<p>Using this method, the penalty for e.g. aligning two different characters will depend on the quality of the characters: high quality means a high penalty, low quality &#8212; lower penalty (since there&#8217;s a greater chance one of them was incorrectly called).</p>
<h3>Scoring of alignments</h3>
<p>When calculating the score of an alignment, we really want to answer the question: how likely is this sequence to be a real poly-A sequence, as opposed to just a random string?  In other words, we are comparing our sequence against two models: the poly-A model, and the background model. Our score will use the <em>ratio</em> of probabilities of the string being produced by the two models.</p>
<p>For the poly-A model, only As are allowed, so the probability of a character occurring is 1 for As and 0 for the others.  For the background model, we&#8217;ll just take a uniform distribution of nucleotides, each getting a probability of 0.25.</p>
<p>Using this scheme, the score for a string s is simply 1/0.25 = 4 for each A, and 0/0.25 = 0 for all others.  We usually work with the logarithm of these numbers to make them more manageable.</p>
<p>The optimal alignment is then simply the longest run of As, since as soon as you multiply with a zero (or add -infinity, if you use <em>log</em>-scores), you lose the whole score.</p>
<h3>Adding quality to the mix</h3>
<p>Of course, the actual sequence isn&#8217;t perfect, and even the poly-A tail is likely to contain the odd G, C, or T.  To determine exactly <em>how</em> likely is where the quality value enters the picture. Using the formula above,  we can calculate the error estimate and decide what the penalty for a mismatch and reward for a match should be.</p>
<p>For the poly-A model, the probability for a match (that is, an actual &#8216;A&#8217; in the sequence) is <em>1-<em>ϵ</em></em>, the probability of a mismatch (a non-A) is <em>ϵ/3</em> (since only one of the three possible substititutions is an A, and for simplicity, we give them equal probability).  Using the formula for  <em><em>ϵ</em></em> as a function of <em>Q</em> (and hopefully not introducing any errors), I get the scores to be:</p>
<pre style="padding-left: 30px;">match q = log (4*(1-1/10**(q/10)))
mismatch q = log 4 - log 3 - q/10*log 10</pre>
<p>Now, we can use this to do a standard Smith-Waterman alignment, calculating a dynamic programming matrix, and searching for an optimal local alignment.</p>
<p>However, since we&#8217;re aligning against a repeated nucleotide, there&#8217;s no real need for a second dimension, and we can use the following recurrence to calculate the &#8220;polyA-score&#8221; <em>M</em> for each position <em>i</em>:</p>
<p style="padding-left: 30px;"><em>M<sub>i</sub> = max (0, S<sub>i</sub> + M<sub>i-1</sub>)</em></p>
<p>To implement this, we first define the list of scores by applying match and mismatch to the list of (nucleotide,quality) pairs.  We also define a scanl-based function to calculate a list of cumulative scores:</p>
<pre style="padding-left: 30px;">scores = map (\(c,q) -&gt; if toUpper c=='A' then match q else mismatch q) qd
cumulative = scanl (\a b -&gt; let r = a + b in max 0 r)</pre>
<p>The only remaining thing is to identify the maximal value which marks the end of the poly-A tail, and the corresponding 0 value that indicates the start.   I wrote a recursive function called &#8221;findmax&#8221; for this, but a better programmer will probably be able to do this with a fold.</p>
<p>Including the parts discussed briefly above, the whole thing looks like this:</p>
<pre style="padding-left: 30px;">findPolyA :: Sequence Nuc -&gt; Maybe (Int,Int)
findPolyA (Seq _ d mq) =
let qd = zip (B.unpack d) (maybe (repeat 15) BB.unpack mq)
scores = map (\(c,q) -&gt; if toUpper c=='A' then match q else mismatch q) qd
match x' = let x = fromIntegral x' in log (4*(1-1/10**(x/10)))
mismatch x' = let x = fromIntegral x' in log 4 - log 3 - x/10*log 10
cumulative = scanl (\a b -&gt; let r = a + b in max 0 r) 0
(zi,mi,maxscore) = findmax $ cumulative scores
in if maxscore &gt; 12 then Just (zi+1,mi) else Nothing  -- arbitrary constant alert!

findmax :: [Double] -&gt; (Int,Int,Double)
findmax = go 0 (0,0,0) . zip [0..]
where go _ cm [] = cm
go _ cm ((i,0):rest) = go i cm rest
go last_z (cmz,cmi,cmx) ((i,x):rest) = if x &gt; cmx then go last_z (last_z,i,x) rest
else go last_z
(cmz,cmi,cmx) rest</pre>
<h3>Availability</h3>
<p>This method is implemented in a simple tool called &#8220;trimpolya&#8221; (<a title="Darcs repository for 'trimpolya'" href="http://malde.org/~ketil/biohaskell/trimpolya">darcs repo</a>), and also in the more general &#8220;dephd&#8221; (<a title="Darcs repository for 'dephd'" href="http://malde.org/~ketil/biohaskell/dephd">darcs</a>, <a title="Dephd at HackageDB" href="http://hackage.haskell.org/package/dephd">hackage</a>) sequence analysis package.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Dephd updates</title>
		<link>http://blog.malde.org/index.php/2009/06/16/dephd-updates/</link>
		<comments>http://blog.malde.org/index.php/2009/06/16/dephd-updates/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 12:14:02 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/06/16/dephd-updates/</guid>
		<description><![CDATA[Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the basecaller phred, but it has since grown a bit beyond that.  A new update was just pushed onto HackageDB, this is just a quick note describing new features.
Filtering out empty sequences.
Phred often produces [...]]]></description>
			<content:encoded><![CDATA[<p>Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the <a href="http://en.wikipedia.org/wiki/Phred_base_calling" title="Phred base calling on Wikipedia">basecaller <tt>phred</tt></a>, but it has since grown a bit beyond that.  A new update was just pushed onto <a href="http://hackage.haskell.org/package/dephd" title="dephd on Hackage">HackageDB</a>, this is just a quick note describing new features.<span id="more-44"></span></p>
<h3>Filtering out empty sequences.</h3>
<p><tt>Phred</tt> often produces zero-length sequences, and this confuses other programs.  While <tt>BLAST</tt> will just output a warning, <tt>SeqClean</tt> &#8212; or to be precise, <tt>cln2qual</tt> &#8212; will break down.   (My own code using the Bioinformatics library treats all sequences the same regardless of length, so zero-length sequences are perfectly okay). Anyway, you can now use <tt>dephd -z</tt> to eliminate them from the output.</p>
<h3>Sequence Clipping</h3>
<p>Sequence trimming or clipping is often necessary to remove contamination like vector sequence, or simply low quality sequence parts.  Typically, both of these occur at the ends of the sequences.  Many programs (including dephd, but also phred, lucy, seqclean and others) add trimming information to the sequence header.  Dephd is now able to act on this information and clip the sequences.  The trimming information is now obsolete as the coordinates have changed, so they are replaced with the clipping coordinates.   Dephd also provides its own quality assessment, and with the -q option, sequence ends where the sliding windown average quality is below 15 will be clipped.  This is pretty heavy-handed, but it seems I get better EST clustering with this enabled.</p>
<h3>Old features</h3>
<p>Of course we retain the old features: reads PHD and Fasta/Qual files, mask (to lower case/N but don&#8217;t clip) by quality, generate quality plots, outputs Fasta/Qual, and ranking sequences by quality.</p>
<p><em>Edit:</em> In the latest release, there&#8217;s now also a fix for a problem with drawing quality graphs with gnuplot.  It turns out that my shiny new Ubuntu ships with gnuplot 4.2, but your crappy old distribution ships with an older version, and that there are some incompatibilities in the input formats.  I&#8217;ve now reverted this to use old-style format only, so hopefully it should work with gnuplots back to 3.7 or so.   And for those SLS or MCC interrim die-hards out there, I&#8217;ll even add an option to dump the gnuplot file itself, so that you can copy it to a floppy and generate the plots on a computer with a modern color display. How&#8217;s that for user friendly?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/06/16/dephd-updates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Optimization again: befuddled by bytestrings</title>
		<link>http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/</link>
		<comments>http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 08:13:01 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/</guid>
		<description><![CDATA[I&#8217;ve been spending the last couple of weeks working on an indexing scheme for sequences, using Bryan O&#8217;Sullivan&#8217;s Bloom filters.  Now, it turned out that when Bryan tested out the code, he found a curious problem:  Apparently, the indexing stage scaled quadratically with sequence length.  This wouldn&#8217;t have been so strange, were it not for [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been spending the last couple of weeks working on an indexing scheme for sequences, using <a href="http://www.serpentine.com/blog/" title="bos' blog">Bryan O&#8217;Sullivan&#8217;s</a> <a href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter" title="bloomfilter at hackageDB">Bloom filters</a>.  Now, it turned out that when Bryan tested out the code, he found a curious problem:  Apparently, the indexing stage scaled quadratically with sequence length.  This wouldn&#8217;t have been so strange, were it not for the fact that I saw the expected linear time usage when I ran the code.  Some more digging about revealed that my laptop also showed quadratic scaling.  Profiling showed the  culprit to be a simple pipeline-style function:</p>
<blockquote>
<pre>swords s = take (fromIntegral (seqlength s)+1-k) . map (B.take k) . B.tails . B.map toUpper $ seqdata s</pre>
</blockquote>
<p>Here, seqdata returns a lazy bytestring, which is also what&#8217;s hiding behind the <tt>B.</tt> qualifier.  Basically, this just builds the list of all lenght-<em>k</em> substrings.  This should, in my opinion, stream nicely, and run in constant space and linear time.  What on earth makes this quadratic?  On only some systems?   Time to dissect the systems in question:</p>
<p><span id="more-28"></span>Comparing our environments, one obvious difference was that the fast system was built using GHC 6.8.1, while the slow ones were 6.8.2.  So we checked with a 6.8.1 snapshot &#8211; no difference, still slow.  We checked the various source repositories involved for some stray patch that somehow had avoided distribution &#8211; nothing turned up. Comparing Cabal setup files didn&#8217;t reveal anything obvious, except that the newer Cabal emitted a lot more information.</p>
<p>Somewhere, something had to be different. Could it be bloomfilter not being a good consumer for the generated words?  The <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">bio library</a> messing up FASTA parsing?  Something else entirely?</p>
<p>Looking at the <tt>swords</tt> function, it looks like a good candidate for deforestation and/or fusion.  Could there be some optimization not being performed in some cases, for some strange reasons?  Since I remember <a href="http://www.cse.unsw.edu.au/~dons/papers/CLS07.html" title="Coutts et al (2007): Streams fusion: from lists to streams to nothing at all">fusion being mentioned</a> occasionally in the context of<a href="http://www.cse.unsw.edu.au/~dons/fps.html" title="bytestring home page"> bytestring</a>s, I checked the libraries again, and found something I&#8217;d missed previously:  Bytestring 0.9.1 on the fast system, 0.9.0.1 on the slow ones.  While the <a href="http://article.gmane.org/gmane.comp.lang.haskell.cafe/38992">announcement</a> didn&#8217;t promise more than a few percent better performance, no stone could be left unturned.  And this proved to be it: after upgrading bytestring, and recompiling binary, biolib, and bloomfilter to use it, my laptop was as fast as the server.</p>
<p>I&#8217;m not sure exactly what caused this, but apparently there were some issues with bytestring fusion, and fusion is disabled in current bytestring versions. At any rate, it seems the newer version is safer, this is now the minimum requirement in <tt>bio.cabal</tt>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/10/24/optimization-again-befuddled-by-bytestrings/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The FastQ file format for sequences</title>
		<link>http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/</link>
		<comments>http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/#comments</comments>
		<pubDate>Tue, 09 Sep 2008 13:32:16 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/</guid>
		<description><![CDATA[It was just brought to my attention that people have started to use a new file format for sequences.  This format, called &#8216;FastQ&#8217; combines both the sequence data itself and the quality data in one file.  That&#8217;s a nice idea, and I implemented support for it, tests, docs and all, in the bio library.  Runs [...]]]></description>
			<content:encoded><![CDATA[<p>It was just brought to my attention that people have started to use a new file format for sequences.  This format, <a href="http://www.bioperl.org/wiki/FASTQ_sequence_format" title="BioPerl Wiki: definition of FastQ">called &#8216;FastQ&#8217;</a> combines both the sequence data itself and the quality data in one file.  That&#8217;s a nice idea, and I implemented support for it, tests, docs and all, in the bio library.  Runs fast, too.  Basically, the format is a sequence of records, each one similar to this:<br />
<span id="more-29"></span></p>
<blockquote>
<pre>
@{sequence header}
{sequence data}
+{sequence header}
{quality data}</pre>
</blockquote>
<p>Note that the sequence header is repeated in there, apparently somebody thought that would be a good idea.   The <tt>{sequence data}</tt> part looks like it does in a Fasta file, except that here it has to be on a single line.  The <tt>{quality data}</tt> is ASCII, each letter representing the quality value 33 lower than it&#8217;s ASCII value.  This opens up another possibility of getting it wrong, since the line of quality data can (and will!) start out with &#8216;+&#8217; or &#8216;@&#8217; occasionally.</p>
<p>Anyway, the implementation seems to be pretty efficient, I wrote a simple program to count the number of sequences in a file:</p>
<blockquote>
<pre>
module Main where

import Bio.Sequence
import System

main = do
  [f] &lt;- getArgs
  print . length =&lt;&lt; readFastQ f</pre>
</blockquote>
<p>Testing it on <a href="ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA000271/fastq/200x36x36-071113_EAS56_0053-s_1_1.fastq.gz">this 440MB file</a> ran in (first cold, then warm cache):</p>
<blockquote>
<pre>
./countFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  1.77s user 0.41s system 53% cpu 4.091 total
./countFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  1.26s user 0.19s system 89% cpu 1.631 total</pre>
</blockquote>
<p>For comparison, I also tried it with &#8216;grep&#8217;:</p>
<blockquote>
<pre>
grep '^@' ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  5.61s user 0.46s system 98% cpu 6.181 total</pre>
</blockquote>
<p>..and in addition to being 50% slower, it gives the wrong answer, since the &#8216;@&#8217; delimiter may occur in the quality data.  Another nice thing is that the Haskell program is IO bound, so it would be even faster if I had a better disk.  (Note to self: talk to boss about getting an SSD for my laptop).</p>
<p><strong>Update</strong>: I did a quick test, comparing it to the old Fasta format.  First, to convert, I replaced the last line in the program above with</p>
<blockquote>
<pre>
readFastQ f &gt;&gt;= writeFasta (f++".fasta")</pre>
</blockquote>
<p>and compiled it as <tt>convertFQ</tt>.  Running this:</p>
<blockquote>
<pre>
./convertFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  9.51s user 2.76s system 40% cpu 30.037 total</pre>
</blockquote>
<p>Then, I made a <tt>countFa</tt> by changing the last line to:</p>
<blockquote>
<pre>print . length =&lt;&lt; readFasta f</pre>
</blockquote>
<p>Running this on the Fasta file generated just now (282MB), I get:</p>
<blockquote>
<pre>./countFQ ../download/200x36x36-071113_EAS56_0053-s_1_1.fastq  1.26s user 0.19s system 89% cpu 1.631 total</pre>
</blockquote>
<p>Here, grep takes 2.25 seconds user time (and gives the correct answer), we&#8217;re still faster.</p>
<p>The only cloud on the horizon is that there is some disagreement about the format.  For example, <a href="http://may2005.archive.ensembl.org/Docs/Pdoc/bioperl-live/Bio/SeqIO/fastq.html" title="Minority report?">somebody thinks quality</a> always should start with a &#8216;!&#8217; (<em>viz</em>. zero).  <a href="http://maq.sourceforge.net/fastq.shtml">Maq developers think</a> it&#8217;s okay to drop the repeated sequence name.  <a href="http://www.rockefeller.edu/genomics/solexa.php">Rockefeller thinks</a> the quality data should be text digits, like the Qual format. And the good people at Solexa had to go and slightly alter&#8230;not the format itself, but <a href="http://maq.sourceforge.net/fastq.shtml" title="FastQ format description at sourceforge">it&#8217;s interpretation,</a> using a different formula to calculate the error probabilities from the quality values. So basically, given a file, there&#8217;s no way to know whether it uses Solexa-style quality information, or regular, Phred-style quality.  If you get <a href="http://maq.sourceforge.net/fastq.shtml">quality values above 60</a>, then maybe you&#8217;re interpreting it wrong.  Then again, maybe not.  Sigh.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A plan for Bloom filters</title>
		<link>http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/</link>
		<comments>http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/#comments</comments>
		<pubDate>Thu, 31 Jul 2008 20:00:44 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/</guid>
		<description><![CDATA[Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until Bryan O&#8217;Sullivan posted a message to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a chapter in the upcoming book.  You can read all about Bloom filters on [...]]]></description>
			<content:encoded><![CDATA[<p>Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until <a href="http://www.serpentine.com/blog/" title="bos' blog">Bryan O&#8217;Sullivan</a> posted a <a href="http://www.mail-archive.com/haskell-cafe@haskell.org/msg41876.html" title="[Haskell-cafe] [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters">message</a> to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a <a href="http://book.realworldhaskell.org/beta/bloomfilter.html" title="Real World Haskell: Chapter 27">chapter in the upcoming book</a>.  You can read all about Bloom filters on <a href="http://en.wikipedia.org/wiki/Bloom_filter">Wikipedia</a>, but the executive summary of it is that it is a structure similar to Data.Set.  Except that it is probabilistic, and may occasionally claim a value is a member when it&#8217;s not.  On the positive side, the Bloom filter is very fast, and speed is independent on the size &#8212; in other words, lookup and insert is <em>O</em>(1) where Data.Set is <em>O</em>(log <em>n</em>).</p>
<p>Comparing sequences to find similarity is a common occurrence in bioinformatics.  For instance, one might want to know where a certain gene is located in the chromosome, or which sequence fragments are similar enough to originate from the same gene. To speed up searches, it is common to index sequences in questions as overlapping, substrings (<em>k</em>-tuples, <em>q</em>-grams).  This index seems like an obvious target for Bloom filters &#8212; large data, time critical, some false positives anyway &#8212; but for some reason, there is <a href="http://hpcb.wustl.edu/pubs/mercuryblastn.pdf" title="Buhler et al.: MercuryBLASTN: Faster DNA sequence comparison using a streaming hardware architecture. (unpublished?)">almost</a> no such applications that use them. Until now.</p>
<p><span id="more-15"></span></p>
<h2>Sequence clustering</h2>
<p>Sequence clustering is a commonly used technique which is usually based on sequence similarity.  I&#8217;ve written one sequence clusterer, <em>xsact</em>, which is based on blocks of exact matches. There are many others, and another example is <a href="http://bioinformatics.oxfordjournals.org/cgi/reprint/19/3/421.pdf" title="Burke et al.: ">d2cluster</a>, which is based on occurence of fixed length words &#8212; which is right up Bloom alley, right?</p>
<p>So a straightforward way to build a Bloom filter based sequence clusterer is to represent each cluster as a set of words &#8212; stored as a Bloom filter.  Now, adding a new sequence to the clusters is a simple matter of extracting the words from the sequence, identifying the cluster(s) containing a sufficient number of these words, and adding the remaining words to that cluster (or the union of the clusters, in the case multiple clusters match).</p>
<p>The interesting thing about this approach is that the whole thing becomes <em>O</em>(<em>kn</em>), for <em>k</em> clusters and data size <em>n</em>.  I think all other clustering algorithms are based on sequence pairs, which makes them <em>O</em>(<em>n²</em>) &#8212; in the worst case, you need to check all pairs. (However, a straightforward similarity-based clustering will have worst-case behavior when no sequence math each other, while suffix-based methods like <em>xsact</em> will have worst case when all sequences match &#8212; so perhaps there is room for a better middle ground?)</p>
<p>Anyway &#8212; while the above looks promising, there is one snag: EST sequences can occur as a copy of the gene, or due various properties of the <a href="http://blog.malde.org/index.php/2008/05/08/cleaning-up-sequences/" title="Cleaning up sequences">manufacturing process</a>, the gene&#8217;s <em>reverse complement</em>.  Thus, we need to be able to compare sequences in both directions simultaneously. This could be achieved with a slightly creative hashing function, but to keep things simple, we&#8217;ll stick with the ossified mental sweat in the provided implementation.</p>
<h2>Index and search</h2>
<p>A somewhat related area is indexing and search.  Let&#8217;s say you have a bunch of DNA sequences (of for each chromosome, perhaps), and a set of ESTs, which you&#8217;ll remember are gene fragments in an unpredictable mix of forward and reverse-complement directions.  Here&#8217;s the plan:</p>
<ol>
<li>index by building a Bloom filter containing the <em>q</em>-tuples for each chromosome</li>
<li> for each EST, look up each <em>q</em>-tuple (first forward, then rev.comp.) against the filters, and assign it to the chromosome containing the most <em>q</em>-tuples</li>
<li>align each EST against the designated chromosome using traditional methods</li>
</ol>
<p>As far as I can tell at this point, for word size <em>q</em>, <em>c</em> chromosomes of length <em>m</em> and <em>e</em> ESTs of lenght <em>n</em>, this should run in something like  <em>O</em>(<em>qn</em>) + <em>O(qenc</em>) + <em>O</em>(<em>emn</em>), compared to just aligning directly at <em>O</em>(<em>ecmn</em>). Note that mn is the big factor here, so on a large scale, we&#8217;re reducing the total work by a factor of <em>c</em>.  (Of course, in real life, <em>nm</em> would be too large, and you&#8217;d use a heuristic, subquadratic alignment step, but this is for illustration purposes.  Try to be a bit generous, will you?)</p>
<h2>Further plans</h2>
<p>That concludes the current plan, but there are certainly improvements that can be made, two obvious ones are</p>
<ul>
<li>hash function that hashes a word and its reverse complement to the same value</li>
<li>break long sequences into partially (1/3) overlapping regions to speed things up even more, as well as give more accurate placements</li>
</ul>
<p>Another thing that struck me is that in my <a href="http://malde.org/~ketil/biohaskell/xml2x/" title="xml2x -- annotating EST sequences from BLASTX hits">annotation tool</a>, I could probably use this as a faster way to store the set of matching proteins before extracting GO terms.  Currently, the performance is limited by XML parsing, so it&#8217;s probably not worth the bother at the moment.</p>
<p>More is likely to come up, but now it&#8217;s time to implement something!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/07/31/a-plan-for-bloom-filters/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Functional bash: bracketing</title>
		<link>http://blog.malde.org/index.php/2008/07/11/functional-bash-bracketing/</link>
		<comments>http://blog.malde.org/index.php/2008/07/11/functional-bash-bracketing/#comments</comments>
		<pubDate>Fri, 11 Jul 2008 08:23:36 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/07/11/functional-bash-bracketing/</guid>
		<description><![CDATA[My current development project is an EST pipeline.  For various reasons, it is implemented in shell &#8212; bash, to be exact.  In other words, the pipeline is a script, or rather a set of scripts, that will tie together the various stages: masking, clustering, assembly, and annotation.
As in any program, there are many occasions where [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.malde.org/wp-content/uploads/2008/07/hazard_lambda_cracked_2.png" title="Functional programming expands your mind"><img src="http://blog.malde.org/wp-content/uploads/2008/07/hazard_lambda_cracked_2.png" alt="Functional programming expands your mind" width="161" align="left" height="159" /></a>My current development project is an EST pipeline.  For various reasons, it is implemented in shell &#8212; bash, to be exact.  In other words, the pipeline is a script, or rather a set of scripts, that will tie together the various stages: masking, clustering, assembly, and annotation.</p>
<p>As in any program, there are many occasions where you want to effect some particular change during some part of the program.  The archetypical example is allocation of local variables. After allocation, the variables are then available to the program until they run out of scope, they then get deallocated automatically.  The technique can be generalized beyond this.  For instance, you (or rather I) may want to set a <tt>$STAGE</tt> variable that indicates the current processing stage, and which should be unset when the stage has finished executing.  Or you may want to run some processing in a different directory, in which case you <em>really</em> want to remember to return to the previous directory when you finish.  The purpose of <em>bracketing</em> is to wrap a section of code with an initial part to be run in advance, and a final part to be run afterwards.</p>
<p><span id="more-16"></span></p>
<h2>Some examples</h2>
<p>When I toyed with PHP ages ago, I&#8217;d often find myself building a section of a page by a) generating a header with some opening tags, b) generating some content, and c) generating a footer with some closing tags.  And when the exact contents of these pieces depend on various factors, and such sections would nest in complicated ways, it should not be a surprise that getting a) and c) to correspond exactly could be a challenge.  With absolutely no enforcement by the language (which may have improved in later years, I wouldn&#8217;t know), this was very fragile.</p>
<p>For <a href="http://www.haskell.org/ghc/docs/latest/html/libraries/xhtml/Text-XHtml.html" title="The XHtml Haskell library">HTML</a> the solution is simple: instead of generating open and close tags separately, have a function take the tag and its contents, and output the contents appropriately surrounded by open and close tags.  If you build the whole document this way, you guarantee that each open tag will have a matching close tag, and that tags will be properly nested.  Another example is Common Lisps with-open-file macro.</p>
<p>In Haskell, there&#8217;s <a href="http://www.haskell.org/ghc/docs/latest/html/libraries/base/Control-Exception.html#v%3Abracket" title="Control.Exception.bracket documentation">bracket (in Control.Exception)</a>, which in addition to being vastly more general also is a <a href="http://p-cos.blogspot.com/2007/02/what-is-point-of-macros.html" title="Pascal Constanza: What is the point of macros?">regular function</a>, thus once again proving Haskell&#8217;s vast technical and moral superiority over the more pedestrian languages&#8230;.but I digress.  I suspect the rather original name is supposed to allude to how brackets (as in those banana-shaped glyphs surrounding this text) consist of an opening bracket, some contents, and a closing bracket.  Anyway, we like Haskell, so we use Haskell terminology.</p>
<p>As a final note, observe that brackets are similar to stack allocation, manual resource management is similar to manual memory management, while using finalizers/destructors is similar to garbage collection.  (It&#8217;s tempting to add &#8220;pick any two&#8221;.)</p>
<h2>Generalized bracket</h2>
<p>While implementing the EST pipeline, I found myself needing, and implementing, several bracket-like functions (including the two previously mentioned: setting and unsetting a variable, and running a subcomputation in a separate directory).  Thus, the question that poses itself is:  Is it possible to do this in a more general way, akin to Haskell&#8217;s bracket?  Here&#8217;s my currently best attempt:</p>
<blockquote>
<pre>bracket(){
    CLOSE=$2
    eval $1
    shift; shift
    eval $*
    eval $CLOSE
}</pre>
</blockquote>
<p>First we store the second parameter, which will be the &#8220;close&#8221; action in a variable.  We then execute the first parameter (the &#8220;open&#8221; action), using <tt>eval</tt> so that variables can be set etc.  We then skip the two first arguments, using shift twice, then <tt>eval</tt> the main action, and finally, <tt>eval</tt> the &#8220;close&#8221; action.</p>
<p>This allows stuff like:</p>
<blockquote>
<pre>bracket "mkdir mytmpdir &amp;&amp; pushd mytmpdir" "popd" mkfiles</pre>
</blockquote>
<p>where the <tt>mkfiles</tt> function is run inside a temporary directory, and where execution resumes in the original directory after completion.  Another example is</p>
<blockquote>
<pre>bracket "echo Entering first stage; STAGE=first" "STAGE=none" echo Current stage is '$STAGE'</pre>
</blockquote>
<p>Note the careful quoting of the variable with single quotes, we don&#8217;t want <tt>$STAGE</tt> to evaluated before it is set in the bracket function.  In other words, the single quotes lets us pass the literal string <tt>$STAGE</tt>, sort of pass by name semantics instead of the default pass by value.</p>
<h2>Perfection is the enemy&#8230;</h2>
<p>If you are an experienced shell programmer, you may at this point have formed an opinion that I am not.  And you&#8217;d be right, of course, but even I can see that there&#8217;s (at least) one obvious bug: we define a global variable named <tt>CLOSE</tt>.  Not only does this have the potential to clash with an existing variables, it also prevents recursive calls to <tt>bracket</tt>.  Possibly, we should generate variable names, or have <tt>$CLOSE</tt> be a stack, or something&#8230;But hey, Mr. Know-it-all, if you&#8217;re so damn good, why not post a comment explaining how it&#8217;s <em>really</em> done?</p>
<p>In other words: feedback and comments are most welcome (although for spam-prevention you may have to register first)!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/07/11/functional-bash-bracketing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Cleaning up sequences</title>
		<link>http://blog.malde.org/index.php/2008/05/08/cleaning-up-sequences/</link>
		<comments>http://blog.malde.org/index.php/2008/05/08/cleaning-up-sequences/#comments</comments>
		<pubDate>Thu, 08 May 2008 10:55:50 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/05/08/cleaning-up-sequences/</guid>
		<description><![CDATA[The first challenge when dealing with sequence data is removing vector and contaminants and other undesirable stuff.  I&#8217;ve been somewhat unhappy with the current state of my EST pipeline, and investigated more closely what is going on.  First, let&#8217;s review
The EST sequencing process

 mRNA is extracted from a sample, and a primer-linker is [...]]]></description>
			<content:encoded><![CDATA[<p>The first challenge when dealing with sequence data is removing vector and contaminants and other undesirable stuff.  I&#8217;ve been somewhat unhappy with the current state of my EST pipeline, and investigated more closely what is going on.  <span id="more-7"></span>First, let&#8217;s review</p>
<h2>The EST sequencing process</h2>
<ol>
<li> mRNA is extracted from a sample, and a <em>primer-linker</em> is attached.  Since mRNAs contain a poly-A tail (a string of As at the end), the primer contains a string of Ts that will hybridize to the poly-A tail, and the short double-stranded end is ready to initiate the duplication to cDNA.</li>
<li>Reverse transcriptase makes a reverse-complemented cDNA copy of the mRNA.  This process will go on for some length, but not necessarily to the end of the mRNA transcript &#8211; the resulting cDNA strand will thus often start some distance from the beginning of the mRNA.</li>
<li>the RNA is removed, and the cDNA methylated to make it more stable.</li>
<li>polymerase duplicates the cDNA strand, making it double stranded.</li>
<li><em>adapters</em> are attached to each end of the double-stranded DNA segment.  These are short sequences that are designed to match one of the <em>vector&#8217;s </em>(below) cloning sites, fixing the DNA segment as an <em>insert</em> in the vector.</li>
<li>the primer-linker contains a target for a restriction enzyme.  This enzyme now chops off part of the primer-linker, discarding the adapter at that end, and revealing the sequence that will ligate (bind) to the other cloning site in the vector.</li>
<li>The <em>vector</em> is a short, circular DNA sequence (plasmid or phagemid) that will work like a small chromosome when inserted into a suitable host, typically a bacteria like E. coli.  Our DNA segment is now ligated to the vector at both ends, and the whole thing is duplicated by growing a bacteria colony.</li>
</ol>
<p>At this point, there has been a number of opportunities for mishaps.  The primer-linker can ligate to something else than a poly-A tail (1), reverse transcription can be &#8211; and often is &#8211; cut short (2), cDNA can be incompletely methylated, making it a possible target for restriction enzymes that will chop it up (3), adapters can ligate with each other, forming <em>chimeric sequences</em> (5), the primer-linker can avoid being cut, leading to it being retained in the sequence (6), and the bacteria can assimilate two vectors, or none, the insert can end up in the wrong direction, possibly a vector can acquire an insert from bacterial DNA (7), and there&#8217;s probably more.</p>
<h2>Sequences and consequences</h2>
<p>What this means, is that in addition to actual mRNA sequence, we will usually get sequences containing vector, and quite often primer, linker, adapter, or E.coli sequence as well.  This needs to be identified, and discarded (masked or trimmed off) before further analysis.</p>
<p>I&#8217;ve been experimenting with the following programs, <a href="http://compbio.dfci.harvard.edu/tgi/software/">SeqClean</a> by TIGR and <a href="http://sourceforge.net/projects/lucy" title="Lucy sourceforge site">Lucy</a>, by <a href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/12/1093" title="Paper, describing the implementation of Lucy">Chou and Holmes (2001).<br />
</a></p>
<h3>SeqClean</h3>
<p><strong>Pros</strong>: Masks against a database of vectors and a database of contaminants. Masks low complexity.</p>
<p><strong>Cons</strong>: Does not take into account sequence quality.   Inherent model expects vector in the first and second third of the sequence, sometimes failing to trim vector, and sometimes trimming valid sequence.  Trims sequences instead of marking them, making it harder to compare results.</p>
<h3>Lucy</h3>
<p><strong>Pros</strong>:  Uses knowledge of vector splice sites, giving it the opportunity for more accurate masking and differentiation between vector and spurious matches.  Takes quality into account.</p>
<p><strong> Cons</strong>: <em>Very</em> picky about vector input, and in particular splice site sequences must be in the correct orientation and sequence. A bug makes it fail on long Fasta headers (although I wrote a fix for that).  Can only deal with a single vector at a time, meaning you need to know for each sequence the vector used.</p>
<p>I&#8217;ve noticed that even after masking with either program, some vector remains.  And, it turns out, I&#8217;m not the only one, my buddies at CBU has the had the same problem, and there&#8217;s also a paper (<a href="http://www.biomedcentral.com/1471-2164/8/416" title="Chen et al. 2007">Chen et al, 2006).</a>  Unfortunately, the remedy suggested there didn&#8217;t work for our sequences, and in fact, only seems to help for one particular vector type &#8211; that we don&#8217;t use.</p>
<p>Currently, I use a rather ugly hack, using <a href="http://malde.org/~ketil/biohaskell/rbr">RBR</a> to mask, first against E.coli contamination with high stringency, and then against vector, using a more lenient settings.  Finally, I use SeqClean to chop off the masked-out bits, as RBR only <strong>X</strong>es out the offending parts, but doesn&#8217;t trim, and <a href="http://compbio.dfci.harvard.edu/tgi/software/">TGICL</a> doesn&#8217;t deal gracefully with masked but untrimmed sequences&#8230;but that&#8217;s another bug.</p>
<h2>The plan?</h2>
<p>It should be relatively easy to write a tool that combines the strengths of Lucy and SeqClean, while avoiding the weaknesses.  A useful tool could use BLAST to find all vectors, know about vector-adapter-linker-primer combinations, and identify them.  It could pinpoint sequences that do not follow the typical vector-linker-sequence-poly-A-linker-vector pattern, and perhaps identify chimeric sequences and other artifacts.</p>
<p>It could do a lot of things.  Unfortunately, at the moment the margin of my calendar is too small to fit it in&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/05/08/cleaning-up-sequences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
