<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Biohaskell</title>
	<atom:link href="http://blog.malde.org/index.php/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.malde.org</link>
	<description>bioinformatics and haskell</description>
	<lastBuildDate>Tue, 20 Jul 2010 15:04:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Updated software versions available</title>
		<link>http://blog.malde.org/index.php/2010/07/20/updated-software-versions-available/</link>
		<comments>http://blog.malde.org/index.php/2010/07/20/updated-software-versions-available/#comments</comments>
		<pubDate>Tue, 20 Jul 2010 14:43:24 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Downloads]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=167</guid>
		<description><![CDATA[I&#8217;ve just uploaded new versions of various software to HackageDB.  If you have cabal-install on your system, it should now be possible to do e.g. cabal install flowsim to automatically download and compile the program and its dependencies.
bio-0.4.6: A bioinformatics library
The library provides functions and data structures to work with various kinds of bioinformatics data.  [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve just uploaded new versions of various software to <a title="The Hackage software repository" href="http://hackage.haskell.org/">HackageDB</a>.  If you have <em>cabal-install</em> on your system, it should now be possible to do e.g. <em>cabal install flowsim</em> to automatically download and compile the program and its dependencies.</p>
<h3>bio-0.4.6: A bioinformatics library<a href="http://blog.malde.org/wp-content/uploads/2010/07/sunflowers.jpg"><img class="alignright size-medium wp-image-171" title="sunflowers" src="http://blog.malde.org/wp-content/uploads/2010/07/sunflowers-282x300.jpg" alt="Image of sunflowers in a miliary cooking utensil thingy" /></a></h3>
<p>The library provides functions and data structures to work with various kinds of bioinformatics data.  It has <a title="The Haskell Bioinformatics Library" href="/index.php/the-haskell-bioinformatics-library/">its own page here</a>.  New features include support for BLAT&#8217;s PSL format, fixes to SFF and limiting the binary dependency to &lt;0.5 to maintain necessary laziness.</p>
<h3>flowsim-0.2.6: A simulator for 454 pyrosequencing data</h3>
<p>Flowsim is new on Hackage, but also has<a title="Flowsim - a simulator for pyrosequencing data" href="index.php/flowsim/"> its own page.</a> The current version is split into two parts, <em>clonesim</em> which simulates fragmenting of a genome, and <em>flowsim</em> proper, which simulates flowgrams from the sequences, and does base- and quality calling in order to output SFF files, similar to those generated from the real thing.  Flowsim will be presented at <a title="ECCB10 web site" href="http://www.eccb10.org/">ECCB&#8217;10</a> in Ghent at the end of September.</p>
<h3>flower-0.3: A set of tools to work with pyrosequencing data</h3>
<p>This is a &#8211; dare I say bouquet?  of various tools for working on and with 454 data.  Flower itself extracts various information from SFF files, and is <a title="Flower - analyzis and extraction from 454 SFF files" href="index.php/flower/">documented here</a>, but the package includes other tools as well, namely: the quite renamable <em>flowt</em> which attempts to remove duplicate clones, an artifact that appears to occur frequently in these data; the more appropriately named <em>frename</em> which relabels reads uniquely (useful if downstream software requires it), and <em>frecover</em> which recovers corrupted SFF files, which happened to us once, but so far hasn&#8217;t happened again.</p>
<p>They should all now be a <em>cabal install</em> away, so if you use these, please let me know how you fare, and whether you find them useful!  I also try to provide <a title="statically linked linux binaries" href="http://malde.org/~ketil/biohaskell/linux-binaries">Linux binaries</a> for recent versions, and hope to provide proper Linux distribution packages (you know, .debs and .rpms) in the future.  If you want to help out, there&#8217;s also the darcs repositories for <a href="/~ketil/biohaskell/biolib">biolib</a>, <a href="/~ketil/biohaskell/flower">flower</a>, and <a href="/~ketil/biohaskell/flowsim">flowsim</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2010/07/20/updated-software-versions-available/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A quick count of popular Haskell libraries in Debian and Ubuntu</title>
		<link>http://blog.malde.org/index.php/2010/07/01/a-quick-count-of-popular-haskell-libraries-in-debian-and-ubuntu/</link>
		<comments>http://blog.malde.org/index.php/2010/07/01/a-quick-count-of-popular-haskell-libraries-in-debian-and-ubuntu/#comments</comments>
		<pubDate>Thu, 01 Jul 2010 08:15:02 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Irrelevant]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=161</guid>
		<description><![CDATA[Don Stewart recently posted a summary of Hackage downloads, which can be seen as a metric of a package&#8217;s popularity.  Of course, there are many other ways to acquire libraries and applications, some may prefer to get the bleeding edge right from the source repository, others prefer the comfort of their distribution&#8217;s repositories.  For the [...]]]></description>
			<content:encoded><![CDATA[<p>Don Stewart recently<a title="Dons hackage summary" href="http://donsbot.wordpress.com/2010/06/30/popular-haskell-packages-q2-2010-report/"> posted a summary</a> of Hackage downloads, which can be seen as a metric of a package&#8217;s popularity.  Of course, there are many other ways to acquire libraries and applications, some may prefer to get the bleeding edge right from the source repository, others prefer the comfort of their distribution&#8217;s repositories.  For the latter, Debian (and by extension, Ubuntu) has a popularity contest package, that inspects the system, and reports back the list of installed packages and their status.  The data are readily available, so I downloaded the <a title="Debian popcon summary" href="http://popcon.debian.org/by_inst">summary from Debian</a> and <a title="Ubuntu popcon summary" href="http://popcon.ubuntu.com/by_inst">Ubuntu</a>.  Grepping for libghc6- .*-dev gives this:</p>
<p><span id="more-161"></span></p>
<pre>% grep libghc6 by_inst.ubuntu.txt | grep -- '-dev ' | head -20
9063  libghc6-mtl-dev                 8399   175  7991   228     5 (Ian Lynagh)
11976 libghc6-glib-dev                4305    25  4184    94     2 (Liyang Hu)
13542 libghc6-x11-dev                 3153    73  2980    99     1 (Ian Lynagh)
14725 libghc6-cairo-dev               2600    17  2535    47     1 (Liyang Hu)
14959 libghc6-xmonad-dev              2502    68  2350    83     1 (Joachim Breitner)
15290 libghc6-gtk-dev                 2367    17  2307    42     1 (Liyang Hu)
15486 libghc6-xmonad-contrib-dev      2288    68  2143    77     0 (Joachim Breitner)
16528 libghc6-quickcheck-dev          1947     2  1938     6     1 (Ian Lynagh)
16829 libghc6-utf8-string-dev         1859    13  1829    16     1 (Unknown)
16953 libghc6-opengl-dev              1825    21  1765    38     1 (Ian Lynagh)
17782 libghc6-network-dev             1622    25  1553    43     1 (Ian Lynagh)
18393 libghc6-gstreamer-dev           1505     9  1447    48     1 (Unknown)
19606 libghc6-xhtml-dev               1276    25  1194    56     1 (Ian Lynagh)
20109 libghc6-x11-xft-dev             1191    69  1042    80     0 (Unknown)
20159 libghc6-parsec-dev              1183     3  1177     1     2 (Ian Lynagh)
20900 libghc6-gtkglext-dev            1079     5  1055    19     0 (Liyang Hu)
21727 libghc6-parsec2-dev              967    41   858    68     0 (Unknown)
22701 libghc6-regex-base-dev           847    21   799    27     0 (Arjan Oosting)
22955 libghc6-pcre-light-dev           819     8   774    37     0 (Unknown)
23107 libghc6-hunit-dev                803    15   767    20     1 (Ian Lynagh)
</pre>
<pre>% grep libghc6 by_inst.debian.txt | grep -- '-dev ' | head -20
6207  libghc6-mtl-dev                 1058   223   649   185     1 (Debian Haskell Group)
7643  libghc6-x11-dev                  677   144   435    98     0 (Debian Haskell Group)
7949  libghc6-xmonad-dev               620   134   392    94     0 (Joachim Breitner)
8414  libghc6-xmonad-contrib-dev       548   125   329    94     0 (Joachim Breitner)
10664 libghc6-x11-xft-dev              349   122   197    30     0 (Debian Haskell Group)
11352 libghc6-network-dev              310    89   181    39     1 (Debian Haskell Group)
11353 libghc6-quickcheck-dev           310     9   287    13     1 (Ian Lynagh)
12947 libghc6-regex-base-dev           238    72   131    34     1 (Debian Haskell Group)
13182 libghc6-parsec2-dev              229    82   101    46     0 (Debian Haskell Group)
13376 libghc6-opengl-dev               222    70   125    27     0 (Debian Haskell Group)
13420 libghc6-regex-posix-dev          220    66   119    34     1 (Debian Haskell Group)
13731 libghc6-hunit-dev                208    74   106    27     1 (Debian Haskell Group)
13788 libghc6-xhtml-dev                206    64   110    32     0 (Debian Haskell Group)
13885 libghc6-regex-compat-dev         203    61   109    32     1 (Debian Haskell Group)
14128 libghc6-stm-dev                  195    71   104    19     1 (Debian Haskell Group)
14665 libghc6-http-dev                 178    60    91    26     1 (Debian Haskell Group)
14707 libghc6-parallel-dev             177    63    81    32     1 (Debian Haskell Group)
14864 libghc6-glut-dev                 173    62    89    22     0 (Debian Haskell Group)
15111 libghc6-html-dev                 166    58    91    16     1 (Debian Haskell Group)
15143 libghc6-zlib-dev                 165    57    80    27     1 (Debian Haskell Group)
</pre>
<p>Some observations:</p>
<ul>
<li> MTL is by far the most popular library, but it&#8217;s only number nine on Don&#8217;s list</li>
<li>GHC (not shown) has about twice the registrations of MTL, in both distributions</li>
<li>but its 17K registered users on Ubuntu is only 1% of the reports, and 2K registered on Debian is about 2%</li>
<li> the most popular libraries after that are graphics-related (X11, xmonad, and Gtk, Cairo and OpenGL on Ubuntu)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2010/07/01/a-quick-count-of-popular-haskell-libraries-in-debian-and-ubuntu/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Snagged!</title>
		<link>http://blog.malde.org/index.php/2010/05/22/snagged/</link>
		<comments>http://blog.malde.org/index.php/2010/05/22/snagged/#comments</comments>
		<pubDate>Sat, 22 May 2010 16:45:57 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[debugging]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=145</guid>
		<description><![CDATA[Recently, I&#8217;ve been burned by a couple of, eh, issues.  Not exactly bugs, but some hidden surprises that have taken some work to iron out.  Below I&#8217;ll make a quick writeup of symptoms, diagnoses, and remedies, in the hope that other people running into the same problems will find it useful.
Static binaries relying on iconv
Symptoms
I [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft" title="Snagged (wetcanvas.com)" src="http://www.wetcanvas.com/Community/images/05-Mar-2007/22016-10._Snagged.jpg" alt="" width="204" height="245" />Recently, I&#8217;ve been burned by a couple of, eh, <em>issues</em>.  Not exactly bugs, but some hidden surprises that have taken some work to iron out.  Below I&#8217;ll make a quick writeup of symptoms, diagnoses, and remedies, in the hope that other people running into the same problems will find it useful.</p>
<h2><span id="more-145"></span>Static binaries relying on iconv</h2>
<h3>Symptoms</h3>
<p>I often build static binaries (you know, <tt>ghc --optl-static --optl-pthread</tt>) when I need to run some program on a system that is different from my build system.  Somehow, this seems easier than installing a full build setup.  When i recently tried to run such an executable on a different system, the system responded with <strong>mkTextEncoding: invalid argument</strong>.  <a title="Try it yourself" href="http://www.google.no/search?q=mkTextEncoding%3A+invalid+argument">Googling</a> didn&#8217;t exactly help, except pointing me to something iconv-related as the culprit.</p>
<h3>Diagnosis</h3>
<p>It turns out that Haskell programs built with GHC now (version 6.12?) rely on iconv to &#8230;well, do some Unicode stuff.  Static linking bundles iconv in the binary, but apparently iconv goes off and reads some dynamic libraries anyway, embedding their paths (as they are on the build system) inside the executable.</p>
<p>Needless to say, this breaks when the target system decides to put these library bits elsewhere.  Specifically, Ubuntu puts these files in<br />
<tt>/usr/lib/gconv/</tt>, while Red Hat puts 32-bit versions in that directory, and 64-bit versions in <tt>/usr/lib64/gconv/</tt>.  A 64-bit binary built on a Ubuntu systems thus tries to load 32-bit library code, and I guess we&#8217;re lucky we even get an error message.</p>
<h3>Remedy</h3>
<p>Far be it from me to question the wisdom of embedding paths to local library code in static executables; instead, let me just commend the developers for providing the <tt>GCONV_PATH</tt> environment variable, which, when set to point to the lib64 directory, made it possible to run my executable.</p>
<p><em><strong>Update</strong>:</em> Apparently, the exact error produced can vary, and I recently got <strong>openFile: invalid argument (Invalid argument)</strong> instead.  Using <tt>strace</tt> it was clear that the executable was loading the wrong library again, and setting <tt>GCONV_PATH</tt> solved the problem.</p>
<h2>Laziness change in <em>binary</em></h2>
<h3>Symptoms</h3>
<p>I have written a small <a title="Flower - analysis and extraction from 454 SFF files" href="http://blog.malde.org/index.php/flower/">program</a> to analyze and extract information from 454 sequencing file.  In order to efficiently process these files, which can be fairly large, I use the excellent <a title="Binary Haskell library" href="http://code.haskell.org/binary/"><strong>binary</strong> library</a> to decode SFF files at disk speeds.  On <a title="Woe is me" href="http://www.mail-archive.com/haskell-cafe@haskell.org/msg62878.html">one occasion</a>, this program started to use a lot of memory, but with a little help from the <a title="John Lato to the rescue!" href="http://osdir.com/ml/haskell-cafe@haskell.org/2009-07/msg01062.html">community</a>, this got sorted out by some inlining.  Recently, the same thing appeared to happen again, but in spite of all my tampering with the previously offending code, I was unable to build any version without this behavior.</p>
<h3>Diagnosis</h3>
<p>After some quick testing that showed that Data.Binary.decode was, contrary to earlier behavior, strictly evaluating its input before returning anything.  A quick mail to the binary developers confirmed that this behavior had been changed, as it sped up GHC itself.</p>
<p>Edit: I didn&#8217;t find it previously, but dons has<a title="Changing binary to make it stricter" href="http://donsbot.wordpress.com/2009/09/16/data-binary-performance-improvments-for-haskell-binary-parsing/"> a nice writeup of the change</a>.</p>
<h3>Remedy</h3>
<p>Flower depends on being able to lazily process the list of sequences in its input, and the new version of binary not only caused it use excessive memory, but also slowed it down.  I don&#8217;t really have any better solution to this (and my use case is probably not important enough to sacrifice GHC performance for <img src='http://blog.malde.org/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  than to require binary version less than 0.5.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2010/05/22/snagged/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tools for pyrosequencing analysis</title>
		<link>http://blog.malde.org/index.php/2010/02/19/tools-for-pyrosequencing-analysis/</link>
		<comments>http://blog.malde.org/index.php/2010/02/19/tools-for-pyrosequencing-analysis/#comments</comments>
		<pubDate>Fri, 19 Feb 2010 21:28:27 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Downloads]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=98</guid>
		<description><![CDATA[I recently did a brief presentation of the set of tools I&#8217;ve developed for analyzing pyrosequences (the Roche 454 variety).  Nothing spectacular, just an overview of various ways of slicing and dicing SFF files using tools written in Haskell.  For lack of a better place to put it, I&#8217;ll drop my slides below.
flowers
]]></description>
			<content:encoded><![CDATA[<p>I recently did a brief<img class="alignright" title="Hokusai poppies" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Hokusai_Poppies.jpg/800px-Hokusai_Poppies.jpg" alt="" width="260" height="179" /> presentation of the set of tools I&#8217;ve developed for analyzing pyrosequences (the Roche 454 variety).  Nothing spectacular, just an overview of various ways of slicing and dicing SFF files using tools written in Haskell.  For lack of a better place to put it, I&#8217;ll drop my slides below.</p>
<p><a title="A presentation of tools for working with SFF files" href="http://blog.malde.org/wp-content/uploads/2010/02/flowers2.pdf">flowers</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2010/02/19/tools-for-pyrosequencing-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Searching for poly(A) tails</title>
		<link>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/</link>
		<comments>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 14:03:28 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=65</guid>
		<description><![CDATA[I&#8217;m currently involved in a project where we study, among other things, the 3&#8242;UTR and poly-A tails of certain genes.  For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn&#8217;t find any program or tool to do just that.  Presumably the task is considered too [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m currently involved in a project where we study, among other things, the 3&#8242;UTR and poly-A tails of certain genes.  For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn&#8217;t find any program or tool to do just that.  Presumably the task is considered too trivial?  So, like many other &#8220;trivial&#8221; tasks, it is performed by ad-hoc solutions that are likely to be suboptimal.</p>
<p>Here&#8217;s a better method that identifies poly-A tails by finding an optimal, quality adjusted alignment in linear time.</p>
<p><span id="more-65"></span></p>
<h3>A quick introduction</h3>
<p>Although the definitions of what constitutes a gene vary considerably, we&#8217;ll use the term to refer to a region of DNA that get<em> <a title="Wikipedia entry for &quot;transcription&quot;" href="http://en.wikipedia.org/wiki/Transcription_%28genetics%29">transcribed</a></em>, that is, copied from DNA into an <a title="Wikipedia entry for &quot;Messenger RNA&quot;" href="http://en.wikipedia.org/wiki/Messenger_RNA">mRNA</a> molecule, which in turn will be used as a blueprint for assembling a protein.  After transcription, the mRNA molecule then undergoes <a title="Wikipedia entry for &quot;polyadenylation&quot;" href="http://en.wikipedia.org/wiki/Polyadenylation"><em>polyadenylation</em></a>, a process where a string of adenines (the &#8216;A&#8217; of the nucleotide alphabet) gets appended to an mRNA molecule before it is exported from the nucleus.</p>
<p>Identifying poly-A tails are important for several reasons.</p>
<ol>
<li>It positively identifies the end of the transcript.  If you don&#8217;t have a poly-A in your sequence, you have no way to know how far the molecule extends beyond the end of the sequence.  You can also find alternatively terminated transcripts this way.</li>
<li>It positively identifies the end of the transcript.  Anything after the poly-A tail is linker or vector sequence, and can safely be trimmed off.  Even if, as is often the case, it is too low quality to be recognized by your average vector masking software.</li>
<li>It provides useful information about the transcript, as the poly-A tail is important for things like protecting the mRNA from degradation.</li>
</ol>
<p>Unfortunately utilities often trim poly-A tails by default (e.g. SeqClean), or just ignore it (e.g. BLAST&#8217;s low-complexity<br />
filter).</p>
<h3>Quality based alignment</h3>
<p>When a molecule is sequenced, the analog output from the sequencing machine is stored as a <em>chromatogram</em>.  In order to be useful, the sequence is <em>called</em>, that is, translated to a string of letters from the familiar {A,C,G,T} nucleotides alphabet.  In addition, the base caller will associate each letter with a <em>quality value</em>.  This is derived from an estimate of the probability of the call being incorrect, and for quality value <em>Q,</em> the error probability estimate is</p>
<pre style="padding-left: 30px;">ϵ = 10<sup>-Q/10</sup></pre>
<p>Traditionally sequence alignment simply aligns the string of characters using a fixed positive score (reward) for aligning similar characters, and fixed negative scores (penalties) for either substituting a different character, opening a gap, or extending a previous gap.</p>
<p>However, taking into account the quality value, <a title="My paper on using sequence quality to improve alignments." href="http://bioinformatics.oxfordjournals.org/cgi/content/full/24/7/897">we can do better</a>, and instead of fixed scores, we can adjust the scores dynamically according to quality.</p>
<p>Using this method, the penalty for e.g. aligning two different characters will depend on the quality of the characters: high quality means a high penalty, low quality &#8212; lower penalty (since there&#8217;s a greater chance one of them was incorrectly called).</p>
<h3>Scoring of alignments</h3>
<p>When calculating the score of an alignment, we really want to answer the question: how likely is this sequence to be a real poly-A sequence, as opposed to just a random string?  In other words, we are comparing our sequence against two models: the poly-A model, and the background model. Our score will use the <em>ratio</em> of probabilities of the string being produced by the two models.</p>
<p>For the poly-A model, only As are allowed, so the probability of a character occurring is 1 for As and 0 for the others.  For the background model, we&#8217;ll just take a uniform distribution of nucleotides, each getting a probability of 0.25.</p>
<p>Using this scheme, the score for a string s is simply 1/0.25 = 4 for each A, and 0/0.25 = 0 for all others.  We usually work with the logarithm of these numbers to make them more manageable.</p>
<p>The optimal alignment is then simply the longest run of As, since as soon as you multiply with a zero (or add -infinity, if you use <em>log</em>-scores), you lose the whole score.</p>
<h3>Adding quality to the mix</h3>
<p>Of course, the actual sequence isn&#8217;t perfect, and even the poly-A tail is likely to contain the odd G, C, or T.  To determine exactly <em>how</em> likely is where the quality value enters the picture. Using the formula above,  we can calculate the error estimate and decide what the penalty for a mismatch and reward for a match should be.</p>
<p>For the poly-A model, the probability for a match (that is, an actual &#8216;A&#8217; in the sequence) is <em>1-<em>ϵ</em></em>, the probability of a mismatch (a non-A) is <em>ϵ/3</em> (since only one of the three possible substititutions is an A, and for simplicity, we give them equal probability).  Using the formula for  <em><em>ϵ</em></em> as a function of <em>Q</em> (and hopefully not introducing any errors), I get the scores to be:</p>
<pre style="padding-left: 30px;">match q = log (4*(1-1/10**(q/10)))
mismatch q = log 4 - log 3 - q/10*log 10</pre>
<p>Now, we can use this to do a standard Smith-Waterman alignment, calculating a dynamic programming matrix, and searching for an optimal local alignment.</p>
<p>However, since we&#8217;re aligning against a repeated nucleotide, there&#8217;s no real need for a second dimension, and we can use the following recurrence to calculate the &#8220;polyA-score&#8221; <em>M</em> for each position <em>i</em>:</p>
<p style="padding-left: 30px;"><em>M<sub>i</sub> = max (0, S<sub>i</sub> + M<sub>i-1</sub>)</em></p>
<p>To implement this, we first define the list of scores by applying match and mismatch to the list of (nucleotide,quality) pairs.  We also define a scanl-based function to calculate a list of cumulative scores:</p>
<pre style="padding-left: 30px;">scores = map (\(c,q) -&gt; if toUpper c=='A' then match q else mismatch q) qd
cumulative = scanl (\a b -&gt; let r = a + b in max 0 r)</pre>
<p>The only remaining thing is to identify the maximal value which marks the end of the poly-A tail, and the corresponding 0 value that indicates the start.   I wrote a recursive function called &#8221;findmax&#8221; for this, but a better programmer will probably be able to do this with a fold.</p>
<p>Including the parts discussed briefly above, the whole thing looks like this:</p>
<pre style="padding-left: 30px;">findPolyA :: Sequence Nuc -&gt; Maybe (Int,Int)
findPolyA (Seq _ d mq) =
let qd = zip (B.unpack d) (maybe (repeat 15) BB.unpack mq)
scores = map (\(c,q) -&gt; if toUpper c=='A' then match q else mismatch q) qd
match x' = let x = fromIntegral x' in log (4*(1-1/10**(x/10)))
mismatch x' = let x = fromIntegral x' in log 4 - log 3 - x/10*log 10
cumulative = scanl (\a b -&gt; let r = a + b in max 0 r) 0
(zi,mi,maxscore) = findmax $ cumulative scores
in if maxscore &gt; 12 then Just (zi+1,mi) else Nothing  -- arbitrary constant alert!

findmax :: [Double] -&gt; (Int,Int,Double)
findmax = go 0 (0,0,0) . zip [0..]
where go _ cm [] = cm
go _ cm ((i,0):rest) = go i cm rest
go last_z (cmz,cmi,cmx) ((i,x):rest) = if x &gt; cmx then go last_z (last_z,i,x) rest
else go last_z
(cmz,cmi,cmx) rest</pre>
<h3>Availability</h3>
<p>This method is implemented in a simple tool called &#8220;trimpolya&#8221; (<a title="Darcs repository for 'trimpolya'" href="http://malde.org/~ketil/biohaskell/trimpolya">darcs repo</a>), and also in the more general &#8220;dephd&#8221; (<a title="Darcs repository for 'dephd'" href="http://malde.org/~ketil/biohaskell/dephd">darcs</a>, <a title="Dephd at HackageDB" href="http://hackage.haskell.org/package/dephd">hackage</a>) sequence analysis package.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/12/14/searching-for-polya-tails/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Installing the software on Ubuntu 9.10</title>
		<link>http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/</link>
		<comments>http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 13:26:08 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Installation]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/</guid>
		<description><![CDATA[Ubuntu 9.10, nicknamed Karmic Koala, is about to be released, and in a moment of idleness, I upgraded my old 9.04 install to the latest beta.  Upgrading is always generates a slight feeling of dread,  taking the plunge from the cozy stability of bugs I&#8217;ve learned to work around, into the great unknown, but it [...]]]></description>
			<content:encoded><![CDATA[<p>Ubuntu 9.10, nicknamed Karmic Koala, is about to be released, and in a moment of idleness, I upgraded my old 9.04 install to the latest beta.  Upgrading is always generates a slight feeling of dread,  taking the plunge from the cozy stability of bugs I&#8217;ve learned to work around, into the great unknown, but it all went even smoother than the previous one.  And on the plus side, ghc is now, <em>finally</em>, upgraded to an almost modern release, (6.10.4) and lots of libraries are included as well.  Great work by Joachim Breitner and his <a href="http://lists.debian.org/debian-haskell/" title="debian-haskell mailing list">army of debianizers</a>.  So I&#8217;m all ready to take advantage of my new compiler and its improvements, but first I need to bring all my software up to speed.  I&#8217;ll make notes here as I go along, and hopefully this will be useful also for users of other Linux distributions.</p>
<p><span id="more-54"></span></p>
<h3>Installing biolib</h3>
<p>First I need to install the bioinformatics library.  I&#8217;m about to release 0.4.1, but this is also a good opportunity to check that everything works with 0.4 (which is what you&#8217;ll find on Hackage), so let&#8217;s do that first.  Using darcs, I pull the repo up to the 0.4 tag (but you can of course get the tarball from <a href="http://hackage.haskell.org/package/bio">Hackage</a>):</p>
<p><code>% ./Setup.hs configure<br />
Configuring bio-0.4...<br />
Setup.hs: At least the following dependencies are missing:<br />
QuickCheck &lt;2, binary -any<br />
</code></p>
<p>(Side note: you may notice that I run <tt>Setup.hs</tt> directly, as opposed to using <tt>runhaskell</tt>.  I prefer it this way, but you may have to do a <tt>chmod +x Setup.hs</tt> if you downloaded this from the darcs repository or similar.)</p>
<p>Since we want to use the system libraries as far as possible, these libraries are just an apt-get away:</p>
<p><code>sudo apt-get install libghc6-quickcheck1\*<br />
sudo apt-get install libghc6-binary-\*<br />
</code></p>
<p>Now, let&#8217;s try again:</p>
<p><code>% ./Setup.hs configure<br />
Configuring bio-0.4...<br />
% ./Setup.hs build<br />
Preprocessing library bio-0.4...<br />
Building bio-0.4...<br />
Binary: Int64 truncated to fit in 32 bit Int<br />
ghc: panic! (the 'impossible' happened)<br />
(GHC version 6.10.4 for i386-unknown-linux):<br />
Prelude.chr: bad argument</code></p>
<p><code>Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug</code></p>
<p>Okay, this was not what was supposed to happen.  As always, dropping to #haskell on IRC is the first thing to do, and sure enough:</p>
<p><code>&lt;sereven&gt; ketil: that's also shown up for xmonad users when .hi and .o files weren't cleaned<br />
between rebuilds mixing different versions. usually between ghc updates IIRC.<br />
</code></p>
<p>Let&#8217;s try to get rid of old cruft lying about, polluting directories:</p>
<p><code>./Setup.hs clean &amp;&amp; ./Setup.hs configure &amp;&amp; ./Setup.hs build</code></p>
<p>Sure enough, this time it worked.  For good measure, we&#8217;ll run the unit tests:</p>
<p><code>make test</code></p>
<p>After a zillion tests, we notice that everything is go, great!</p>
<h3>Applications</h3>
<p>Next, it is time to go through the list of bioinformatics applications.  Since my working directory is a mess of branches and versions, we&#8217;ll just go over the published applications and versions on Hackage.</p>
<p><strong>xsact</strong> is an application to do sequence (in particular EST sequence) clustering.  It predates and thues doesn&#8217;t actually use the bioinformatics library, but we&#8217;ll check it anway.  So we try the familiar command line:</p>
<p><code>./Setup.hs clean &amp;&amp; ./Setup.hs configure &amp;&amp; ./Setup.hs build</code></p>
<p>And things compile.  However, the version on Hackage is outdated, so we&#8217;ll upload a new version, 1.7.1.  One test case still fails, but I can&#8217;t imagine anybody is using it to generate Newick-formatted trees &#8212; I am certainly not &#8212; and since there are many equally correct outputs (including tree rotations and rounding modes), output is likely correct anyway.  Holler if you need it!</p>
<p><strong>rbr</strong> is an application to mask repeats in sequence data.  Normally, this is done using a library of known repeats, but this application tries to do it using statistics, making the &#8212; I think justifiable &#8212; assumption that repeats are going to be more common than non-repeats. The version on Hackage is old, and only works with the library prior to 0.4, so again this is a good time to push the latest changes out in the limelight.  Compiling this works great, by the way.</p>
<p><strong>cluster_tools</strong> is a package that contains a bunch of binaries, useful for working with the results of sequence clustering, including extracting various information from ACE files.  This uses another library, called <em>simpleargs</em>, that simplifies command line argument parsing for simple cases.  Again, the Hackage version is for bio&lt;0.4, so a new version will be pushed.  At the same time, we make a mental note to push version 0.2 of simpleargs to Hackage as well, instead of keeping age-old modifications buried forever.</p>
<p><strong>dephd</strong> is my Swiss-army-knife of sequence analysis, and lets you do various things like converting between formats, plotting and trimming by quality.  This is a more live project than most of the others (I&#8217;m currently working on improved quality trimming and automatic generation of files for submission to GenBank), but the currently available version also compiles without incident.</p>
<p><strong>estreps</strong> is a couple of programs I needed for repeat analysis, perhaps not tremendously interesting, but at least <tt>rselect</tt>, which lets you select randomized subsets from Fasta files, might be of interest to some?  We try the usual invocation to compile, and get:</p>
<p><code>src/Unigene.hs:24:23:<br />
Couldn't match expected type `a' against inferred type `Unknown'<br />
`a' is a rigid type variable bound by<br />
the type signature for `clusters' at src/Unigene.hs:22:41<br />
</code></p>
<p>This error arises due to the introduction of phantom types for identifying sequences introduced in bio 0.4.  Unfortunately, this version of <em>estreps</em> contains some adaption to this model, so it won&#8217;t compile against older versions either.  So it looks like yet another sdist for Hackage.  Look for version 0.3.1.</p>
<p><strong>flower</strong> is a utility for extracting information from SFF files (containing sequences from Roche&#8217;s 454 machines).  Although a new version is around the corner, the old 0.2 just works.</p>
<p><strong>xml2x</strong> is a utility for converting BLAST results in XML format into CSVs, that somehow is more compatible with biologists.  Trying to compile it fails, with the following error:<br />
<code><br />
src/Xml2X.hs:152:49:<br />
Couldn't match expected type `[b]'<br />
against inferred type `Maybe Bio.Sequence.KEGG.KO'<br />
In the first argument of `concatMap', namely `(flip M.lookup ks)'<br />
In the first argument of `($)', namely<br />
`concatMap (flip M.lookup ks)'<br />
In the second argument of `($)', namely<br />
`concatMap (flip M.lookup ks) $ map chop $ map subject fs'<br />
</code></p>
<p>It turns out that somewhere along the way, the <tt>lookup</tt> function from <tt>Data.Map</tt> was <a href="http://hackage.haskell.org/trac/ghc/ticket/2309" title="GHC ticket">de-generalized</a> from working on arbitrary monads to just returning a <tt>Maybe</tt>.  I was using this to return a list, using the empty list to signal an unsuccessful lookup.  This is easily remedied, but that means yet another sdist for Hackage.</p>
<p><strong>korfu</strong> is a utility for identifying open reading frames in sequence data.  It hasn&#8217;t yet been ported to version 0.4 of the library, but works if you install 0.3.5.  I updated this too, since it didn&#8217;t have a category.  Now it too resides in the bioinformatics section.</p>
<h3>Summing up</h3>
<p>In retrospect, it seems like giving old code a thorough spring cleaning once in a while.  Although nothing really critical or difficult happened, a good number of small annoyances were discovered, and a bunch of new sdists are now ready to be uploaded to Hackage.   Next will be converting all this into debian packages.</p>
<p>The important question is of course, how do we avoid this in the future?  During development, it is important to be able to modify libraries and appliations, but installing a new version of the biolib, say, overwrites the old one, and suddenly I&#8217;m compiling and testing everything against a different library than Joe Random Hackage User is going to find.  I have some thoughts on how to avoid this, but if you have a method that works nicely, I&#8217;m all ears.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/10/07/installing-the-software-on-ubuntu-910/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A (too) brief Biohaskell presentation</title>
		<link>http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/</link>
		<comments>http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/#comments</comments>
		<pubDate>Tue, 15 Sep 2009 12:27:00 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Downloads]]></category>
		<category><![CDATA[Examples]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/</guid>
		<description><![CDATA[I was recently in Trondheim, and got an opportunity to present Haskell to an audience of bioinformaticians.  Alas, it is hard to describe Haskell in all its glory to the uninitiated in forty-five minutes, and especially when I also wanted to talk a bit about the application to bioinformatics.  I left in the belief that [...]]]></description>
			<content:encoded><![CDATA[<p>I was recently in Trondheim, and got an opportunity to present Haskell to an audience of bioinformaticians.  Alas, it is hard to describe Haskell in all its glory to the uninitiated in forty-five minutes, and especially when I also wanted to talk a bit about the application to bioinformatics.  I left in the belief that I managed to communicate some of the ideas, and submit the slides and other material here for posterity.  (I&#8217;m happy to receive comments, too, just in case I&#8217;ll do a revised version of the talk later on).</p>
<ul>
<li> <a href="http://blog.malde.org/wp-content/uploads/2009/09/biohaskell.pdf" title="biohaskell slides">My biohaskell slides</a></li>
<li><a href="http://blog.malde.org/wp-content/uploads/2009/09/biohaskell.tex" title="slides’ LaTeX source, using beamer.cls">slides’ LaTeX source, using beamer.cls</a></li>
<li><a href="http://blog.malde.org/wp-content/uploads/2009/09/lpssm.ps" title="Lazy PSSM  paper">Lazy PSSM JFP-paper</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/09/15/a-too-brief-biohaskell-presentation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing ints</title>
		<link>http://blog.malde.org/index.php/2009/08/31/parsing-ints/</link>
		<comments>http://blog.malde.org/index.php/2009/08/31/parsing-ints/#comments</comments>
		<pubDate>Mon, 31 Aug 2009 13:06:35 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/08/31/parsing-ints/</guid>
		<description><![CDATA[A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the quality data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://upload.wikimedia.org/wikipedia/commons/f/ff/Cempedak_opened1.JPG" alt="Artocarpus integer" align="right" width="376" height="250" />A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the <em>quality</em> data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on the slow side, I discovered that although parsing Fasta sequence files (with <strong>readFasta</strong> from <strong>Bio.Sequence.Fasta</strong>) processes at a rate in excess of 200MB/s on my laptop, reading sequences with quality data (using <strong>readFastaQual</strong>) was much slower, about 2-3 MB/s.   After some investigation and a few rewrites, it&#8217;s up to about 15MB/s, but still pretty far from plain sequence.  Below are three (and a half) different versions, and the hope somebody can improve on it even furter.</p>
<p><span id="more-48"></span></p>
<h3>The original version</h3>
<p>I&#8217;ve taken the liberty of cleaning things up a bit, and removing some context that I sure hope isn&#8217;t necessery.  Bascially, the task is to take a list of ByteString input lines consisting of whitespace separated decimal integers, and build a ByteString consisting of single byte quality values corresponding to those integers.</p>
<p>Below is how the naive quality parsing function might look, unpacking each word to <strong>String</strong> in order to use <strong>read</strong>.  Note that <strong>B</strong> is lazy ByteString.Char8, <strong>BB</strong> is lazy ByteString, that is, the <strong>Word8</strong> version.</p>
<p><tt> </tt></p>
<pre><tt>BB.pack $ map (read . B.unpack) $ B.words $ B.unlines ls
</tt></pre>
<p><tt></tt></p>
<p>We don&#8217;t expect this to do terribly well, and I guess it&#8217;s no surprise when this parses my test file of about 10MB in 24 seconds.</p>
<h3>Improved versions</h3>
<p>The key to improved performance is first and foremost to avoid the unpacking and parsing of strings.  Thankfully, the ByteString library provides a <strong>readInt</strong> function.</p>
<p><tt> </tt></p>
<pre><tt>
BB.pack [lookup x | x &lt;- concatMap B.words ls]
    where
    lookup x = case B.readInt x of Just (v,_) -&gt; fromIntegral v
                                   Nothing -&gt; error "Unparsable qual value"
</tt></pre>
<p>This isn&#8217;t a lot more complicated than our initial attempt, but the speed increas is considerable: slightly less than 2 seconds for the test file, more than a tenfold improvement.  The ByteString implementation will share the storage of the separate words with the original strings, but since <strong>readInt</strong> gives us back the rest of the string in addition to the parsed integer, we might as well make use of it:</p>
<pre><tt>
BB.pack $ readInts $ B.unlines ls
    where readInts xs = case B.readInt xs of </tt></pre>
<pre><tt>                          Just (i,rest) -&gt; fromIntegral i : readInts (B.dropWhile isSpace rest)
                          Nothing -&gt; []
</tt></pre>
<p>This turns out to be a bit faster, time is now 1.6 seconds.  Another 20% shaved off.</p>
<h3>Final version</h3>
<p>We&#8217;re really only interested in <strong>Word8</strong> values, since quality values always are small, and since that&#8217;s what gets encoded in the result anyway.  The previous versions takes a detour by reading <strong>Int</strong>s and using <strong>fromIntegral</strong> to convert them to the desired size.  It bears noting that there is no error checking involved, <strong>fromIntegral</strong> will happily and silently truncate any number beyond its target range.  So lets do things explicitly, using <strong>Word8</strong>s throughout the computation:<br />
<tt> </tt></p>
<pre><tt>
BB.pack $ go 0 ls
    where
    isDigit x = x &lt;= 58 &amp;&amp; x &gt;= 48
    go i (s:ss) = case BB.uncons s of </tt></pre>
<pre><tt>                    Just (c,rs) -&gt; if isDigit c then go (c - 48 + 10*i) (rs:ss)
                                   else let rs' = BB.dropWhile (not . isDigit) rs
                                        in if BB.null rs' then i : go 0 ss else i : go 0 (rs':ss)
                    Nothing -&gt; i : go 0 ss
    go _ [] = []
</tt></pre>
<p>This is the fastest one so far, clocking in at 0.94 seconds, over 40% faster than the best <strong>readInt</strong> version, and about 25 times faster than the naive version.  Still, 10MB/s is well below the average hard disk.</p>
<p>So is there more room for improvement?  The most obvious wart to me is the rather artificial splitting into lines.  This is mostly an artifact of some early design desicions, and it should be possible to eliminate the splitting earlier on and saving even more time by simplify this function quite a bit.</p>
<p>If you spot anything else, or have suggestions, I (and my darcs repo) am all ears.</p>
<p><strong>Edit:</strong> Since some people have asked, I&#8217;ve wrapped up a simple test program along with some test files at <a href="http://malde.org/~ketil/biohaskell/qualparsetest ">http://malde.org/~ketil/biohaskell/qualparsetest</a>.  This is a simplified version, if you want to be <em>really</em> helpful, you could always look at <strong>Bio.Sequence.Fasta</strong> in the <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">Bioinformatics library</a> and see if you can speed up e.g. <em>dephd -i input.fasta input.qual -F /dev/null</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/08/31/parsing-ints/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A set of tools for working with 454 sequences</title>
		<link>http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/</link>
		<comments>http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/#comments</comments>
		<pubDate>Fri, 03 Jul 2009 21:32:10 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[sequence analysis]]></category>
		<category><![CDATA[sff]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/</guid>
		<description><![CDATA[Pyrosequencing is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional Sanger sequencing as previous-generation sequencing) as it produces large amounts of sequences at lower costs.  As the technology is radically different, so are the type of data that results from it, and while it is possible [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Dhalia_flower.jpg/800px-Dhalia_flower.jpg" alt="Random flower courtesy of wikimedia" width="212" height="158" align="left" /><a title="Pyrosequencing from Wikipedia" href="http://en.wikipedia.org/wiki/Pyrosequencing">Pyrosequencing</a> is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional <a title="Old school sequencing, Wikipedia" href="http://en.wikipedia.org/wiki/DNA_sequencing">Sanger sequencing</a> as previous-generation sequencing) as it produces large amounts of sequences at lower costs.  As the technology is radically different, so are the type of data that results from it, and while it is possible to use many of the same software tools for working with new sequences, there is a clear need for specific ones as well.</p>
<p>The <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">Haskell bioinformatics library </a>has for some time now supported reading and writing the SFF format, which is used by the oldest (previousest?) of the next generation technologies, namely Roche&#8217;s 454 sequencing.  Once the library functionality is in place, it is easy to develop small tools for doing the various chores.   After spending some time in anticipation of the hordes of programmers no doubt rushing to exploit the monumental effort put I down in  the library, I&#8217;ve instead written a few programs myself, including tools information/statistics extraction (flower),  extracting sequences by various criteria (fselect), simulating sequencing (pyrosim), and repairing broken SFF files (frepair).  This is their story.<span id="more-45"></span></p>
<p>The bioinformatics library includes functions and data structures to read and write and extract data from SFF files.  It provides reasonable performance (meaning that in most cases, my disk is the limiting factor).  It is also a clean room implementation, built from <a title="454 manual (see page 528++)" href="http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf">official documentation</a>, but not based on Roche&#8217;s (or anybody else, for that matter) code.  It comes with an LGPL license, which I hope will make it useful while encouraging back contributions.</p>
<h3>Flower</h3>
<p><a title="Previous article on flower" href="http://blog.malde.org/index.php/flower/">This program</a> can extract  various information from SFF files.  To some extent, it overlaps with &#8217;sffinfo&#8217; and &#8217;sfffile&#8217; from 454, but in addition to generating Fasta sequences (optionally with quality), it can also extract directly to the more compact FastQ format.  It also can output a (huge) table of flow values, the histogram of flow values (useful for estimating the flow distributions, and thus probabilities for the base calls), or a one-line summary of each read that includes some statistics on lenghts and quality.  Here&#8217;s the usage info:</p>
<blockquote>
<pre>flower: Usage: flower -[f|q|r|R] &lt;file.sff&gt; [&lt;file2.sff&gt; ..]
  -r  output reads in Fasta format
  -R  output reads in Fasta format with associated .qual
      (generates files instead of writing to &lt;stdout&gt;)
  -q  output in FastQ format
  -f  output the flowgram in tabular format
  -h  output a histogram table of flow values
  -s  output a summary of each read</pre>
</blockquote>
<h3>FSelect</h3>
<p>FSelect takes an SFF file and produces a new SFF file containing a subset of the sequences, using the same statistics as Flower can output.  It has a small expression language built in, so that you can build more complex logical queries.  For instance, if you want to extract the sequences with lenght between 300 and 400, and with a K² score greater than 0.7, you could do</p>
<blockquote>
<pre>fselect "And (Func LT k2 0.7) (And (Func GT len 200) (Func LT len 400))" FL61AHU01.sff</pre>
</blockquote>
<p>Okay, it&#8217;s a bit clunky, but the syntax should be reasonably straightforward: Logical operators are <tt>And</tt>, <tt>Not</tt>, and <tt>Or</tt>, while <tt>Func op f v</tt> defines a function using <tt>op</tt> (either <tt>LT</tt> or <tt>GT</tt>) to compare <tt>f </tt>(one of <tt>k2, len, tlen, ncount</tt>)  to each read.  Output is generated in a file named &#8220;selected.sff&#8221;.<br />
FSelect can also select random reads, using the select expression <tt>"Rand p"</tt>, where p is the probability for selection.  (I.e. <tt>fselect "Rand 0.2" FL61AHU01.sff </tt> will select each read with a probability of 0.20, giving you approximately 20% of the reads.  Random selection can not be combined with other criteria at this point, if you want this, you&#8217;ll have to run <tt>fselect</tt> multiple times.</p>
<h3>FRecover</h3>
<p>We had some issues with broken SFF files, specifically there were block of zero bytes at random places. Both <tt>flower</tt> and <tt>sffinfo</tt> just terminate on encountering a broken read, so I implemented a simple utility that attempts to skip the broken block and continue extracting good reads beyond the trouble.</p>
<h3>PyroSim</h3>
<p>This attempts to simulate pyrosequencing, for now it only does the GS20  generation of 454 sequencing, but the other generations should be easy to add.  The main problem is that the algorithm for quality calling is insufficiently documented.  GS20 has the advantage that quality for a homopolymer is uniquely derived from the flow value (modulo rounding), so reverse engineering it is fairly straightforward.</p>
<p>Anyway, <tt>pyrosim</tt> takes a &#8216;generation&#8217;  (at the moment, this is only GS20) and a Fasta file as input parameters, picks random points in the Fasta file, and produces the correspondig flowgram, including a suitable perturbation of the values to introduce the expected measure of noise, calls the bases and quality, and outputs an SFF file.</p>
<h3>Availability</h3>
<p>Flower, FSelect and FRecover are part of the &#8220;flower&#8221; package, available from the <a title="Darcs revision control system" href="http://darcs.net/">darcs</a> archive at <a title="flower darcs repo" href="http://malde.org/~ketil/biohaskell/flower">http://malde.org/~ketil/biohaskell/flower</a></p>
<p>PyroSim is available separately (for now, at least) from the darcs archive at  <a title="pyrosim darcs repo" href="http://malde.org/~ketil/biohaskell/pyrosim">http://malde.org/~ketil/biohaskell/pyrosim</a></p>
<p>I try to upload what I consider stable versions to HackageDB, please check the <a title="Bioinformatics at Hackage" href="http://hackage.haskell.org/packages/archive/pkg-list.html#cat:bioinformatics">Bioinformatics</a> category there.  Currently, these programs are in a bit of flux, so going with the darcs repo is probably a good idea at this point.</p>
<p>I&#8217;d really like to have packages for the most common Linux distributions as well (i.e. .debs and .rpms), but I don&#8217;t know the details of how to produce them, and while I&#8217;ve made half-hearted attempts in the past, I guess I just don&#8217;t really desire it enough.  I&#8217;d be happy to see somebody package it up, so if you know how, please go ahead.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dephd updates</title>
		<link>http://blog.malde.org/index.php/2009/06/16/dephd-updates/</link>
		<comments>http://blog.malde.org/index.php/2009/06/16/dephd-updates/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 12:14:02 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/06/16/dephd-updates/</guid>
		<description><![CDATA[Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the basecaller phred, but it has since grown a bit beyond that.  A new update was just pushed onto HackageDB, this is just a quick note describing new features.
Filtering out empty sequences.
Phred often produces [...]]]></description>
			<content:encoded><![CDATA[<p>Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the <a href="http://en.wikipedia.org/wiki/Phred_base_calling" title="Phred base calling on Wikipedia">basecaller <tt>phred</tt></a>, but it has since grown a bit beyond that.  A new update was just pushed onto <a href="http://hackage.haskell.org/package/dephd" title="dephd on Hackage">HackageDB</a>, this is just a quick note describing new features.<span id="more-44"></span></p>
<h3>Filtering out empty sequences.</h3>
<p><tt>Phred</tt> often produces zero-length sequences, and this confuses other programs.  While <tt>BLAST</tt> will just output a warning, <tt>SeqClean</tt> &#8212; or to be precise, <tt>cln2qual</tt> &#8212; will break down.   (My own code using the Bioinformatics library treats all sequences the same regardless of length, so zero-length sequences are perfectly okay). Anyway, you can now use <tt>dephd -z</tt> to eliminate them from the output.</p>
<h3>Sequence Clipping</h3>
<p>Sequence trimming or clipping is often necessary to remove contamination like vector sequence, or simply low quality sequence parts.  Typically, both of these occur at the ends of the sequences.  Many programs (including dephd, but also phred, lucy, seqclean and others) add trimming information to the sequence header.  Dephd is now able to act on this information and clip the sequences.  The trimming information is now obsolete as the coordinates have changed, so they are replaced with the clipping coordinates.   Dephd also provides its own quality assessment, and with the -q option, sequence ends where the sliding windown average quality is below 15 will be clipped.  This is pretty heavy-handed, but it seems I get better EST clustering with this enabled.</p>
<h3>Old features</h3>
<p>Of course we retain the old features: reads PHD and Fasta/Qual files, mask (to lower case/N but don&#8217;t clip) by quality, generate quality plots, outputs Fasta/Qual, and ranking sequences by quality.</p>
<p><em>Edit:</em> In the latest release, there&#8217;s now also a fix for a problem with drawing quality graphs with gnuplot.  It turns out that my shiny new Ubuntu ships with gnuplot 4.2, but your crappy old distribution ships with an older version, and that there are some incompatibilities in the input formats.  I&#8217;ve now reverted this to use old-style format only, so hopefully it should work with gnuplots back to 3.7 or so.   And for those SLS or MCC interrim die-hards out there, I&#8217;ll even add an option to dump the gnuplot file itself, so that you can copy it to a floppy and generate the plots on a computer with a modern color display. How&#8217;s that for user friendly?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/06/16/dephd-updates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
