<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Biohaskell &#187; Uncategorized</title>
	<atom:link href="http://blog.malde.org/index.php/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.malde.org</link>
	<description>bioinformatics and haskell</description>
	<lastBuildDate>Tue, 20 Jul 2010 15:04:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Snagged!</title>
		<link>http://blog.malde.org/index.php/2010/05/22/snagged/</link>
		<comments>http://blog.malde.org/index.php/2010/05/22/snagged/#comments</comments>
		<pubDate>Sat, 22 May 2010 16:45:57 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[debugging]]></category>

		<guid isPermaLink="false">http://blog.malde.org/?p=145</guid>
		<description><![CDATA[Recently, I&#8217;ve been burned by a couple of, eh, issues.  Not exactly bugs, but some hidden surprises that have taken some work to iron out.  Below I&#8217;ll make a quick writeup of symptoms, diagnoses, and remedies, in the hope that other people running into the same problems will find it useful.
Static binaries relying on iconv
Symptoms
I [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft" title="Snagged (wetcanvas.com)" src="http://www.wetcanvas.com/Community/images/05-Mar-2007/22016-10._Snagged.jpg" alt="" width="204" height="245" />Recently, I&#8217;ve been burned by a couple of, eh, <em>issues</em>.  Not exactly bugs, but some hidden surprises that have taken some work to iron out.  Below I&#8217;ll make a quick writeup of symptoms, diagnoses, and remedies, in the hope that other people running into the same problems will find it useful.</p>
<h2><span id="more-145"></span>Static binaries relying on iconv</h2>
<h3>Symptoms</h3>
<p>I often build static binaries (you know, <tt>ghc --optl-static --optl-pthread</tt>) when I need to run some program on a system that is different from my build system.  Somehow, this seems easier than installing a full build setup.  When i recently tried to run such an executable on a different system, the system responded with <strong>mkTextEncoding: invalid argument</strong>.  <a title="Try it yourself" href="http://www.google.no/search?q=mkTextEncoding%3A+invalid+argument">Googling</a> didn&#8217;t exactly help, except pointing me to something iconv-related as the culprit.</p>
<h3>Diagnosis</h3>
<p>It turns out that Haskell programs built with GHC now (version 6.12?) rely on iconv to &#8230;well, do some Unicode stuff.  Static linking bundles iconv in the binary, but apparently iconv goes off and reads some dynamic libraries anyway, embedding their paths (as they are on the build system) inside the executable.</p>
<p>Needless to say, this breaks when the target system decides to put these library bits elsewhere.  Specifically, Ubuntu puts these files in<br />
<tt>/usr/lib/gconv/</tt>, while Red Hat puts 32-bit versions in that directory, and 64-bit versions in <tt>/usr/lib64/gconv/</tt>.  A 64-bit binary built on a Ubuntu systems thus tries to load 32-bit library code, and I guess we&#8217;re lucky we even get an error message.</p>
<h3>Remedy</h3>
<p>Far be it from me to question the wisdom of embedding paths to local library code in static executables; instead, let me just commend the developers for providing the <tt>GCONV_PATH</tt> environment variable, which, when set to point to the lib64 directory, made it possible to run my executable.</p>
<p><em><strong>Update</strong>:</em> Apparently, the exact error produced can vary, and I recently got <strong>openFile: invalid argument (Invalid argument)</strong> instead.  Using <tt>strace</tt> it was clear that the executable was loading the wrong library again, and setting <tt>GCONV_PATH</tt> solved the problem.</p>
<h2>Laziness change in <em>binary</em></h2>
<h3>Symptoms</h3>
<p>I have written a small <a title="Flower - analysis and extraction from 454 SFF files" href="http://blog.malde.org/index.php/flower/">program</a> to analyze and extract information from 454 sequencing file.  In order to efficiently process these files, which can be fairly large, I use the excellent <a title="Binary Haskell library" href="http://code.haskell.org/binary/"><strong>binary</strong> library</a> to decode SFF files at disk speeds.  On <a title="Woe is me" href="http://www.mail-archive.com/haskell-cafe@haskell.org/msg62878.html">one occasion</a>, this program started to use a lot of memory, but with a little help from the <a title="John Lato to the rescue!" href="http://osdir.com/ml/haskell-cafe@haskell.org/2009-07/msg01062.html">community</a>, this got sorted out by some inlining.  Recently, the same thing appeared to happen again, but in spite of all my tampering with the previously offending code, I was unable to build any version without this behavior.</p>
<h3>Diagnosis</h3>
<p>After some quick testing that showed that Data.Binary.decode was, contrary to earlier behavior, strictly evaluating its input before returning anything.  A quick mail to the binary developers confirmed that this behavior had been changed, as it sped up GHC itself.</p>
<p>Edit: I didn&#8217;t find it previously, but dons has<a title="Changing binary to make it stricter" href="http://donsbot.wordpress.com/2009/09/16/data-binary-performance-improvments-for-haskell-binary-parsing/"> a nice writeup of the change</a>.</p>
<h3>Remedy</h3>
<p>Flower depends on being able to lazily process the list of sequences in its input, and the new version of binary not only caused it use excessive memory, but also slowed it down.  I don&#8217;t really have any better solution to this (and my use case is probably not important enough to sacrifice GHC performance for <img src='http://blog.malde.org/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  than to require binary version less than 0.5.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2010/05/22/snagged/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing ints</title>
		<link>http://blog.malde.org/index.php/2009/08/31/parsing-ints/</link>
		<comments>http://blog.malde.org/index.php/2009/08/31/parsing-ints/#comments</comments>
		<pubDate>Mon, 31 Aug 2009 13:06:35 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/08/31/parsing-ints/</guid>
		<description><![CDATA[A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the quality data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://upload.wikimedia.org/wikipedia/commons/f/ff/Cempedak_opened1.JPG" alt="Artocarpus integer" align="right" width="376" height="250" />A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the <em>quality</em> data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on the slow side, I discovered that although parsing Fasta sequence files (with <strong>readFasta</strong> from <strong>Bio.Sequence.Fasta</strong>) processes at a rate in excess of 200MB/s on my laptop, reading sequences with quality data (using <strong>readFastaQual</strong>) was much slower, about 2-3 MB/s.   After some investigation and a few rewrites, it&#8217;s up to about 15MB/s, but still pretty far from plain sequence.  Below are three (and a half) different versions, and the hope somebody can improve on it even furter.</p>
<p><span id="more-48"></span></p>
<h3>The original version</h3>
<p>I&#8217;ve taken the liberty of cleaning things up a bit, and removing some context that I sure hope isn&#8217;t necessery.  Bascially, the task is to take a list of ByteString input lines consisting of whitespace separated decimal integers, and build a ByteString consisting of single byte quality values corresponding to those integers.</p>
<p>Below is how the naive quality parsing function might look, unpacking each word to <strong>String</strong> in order to use <strong>read</strong>.  Note that <strong>B</strong> is lazy ByteString.Char8, <strong>BB</strong> is lazy ByteString, that is, the <strong>Word8</strong> version.</p>
<p><tt> </tt></p>
<pre><tt>BB.pack $ map (read . B.unpack) $ B.words $ B.unlines ls
</tt></pre>
<p><tt></tt></p>
<p>We don&#8217;t expect this to do terribly well, and I guess it&#8217;s no surprise when this parses my test file of about 10MB in 24 seconds.</p>
<h3>Improved versions</h3>
<p>The key to improved performance is first and foremost to avoid the unpacking and parsing of strings.  Thankfully, the ByteString library provides a <strong>readInt</strong> function.</p>
<p><tt> </tt></p>
<pre><tt>
BB.pack [lookup x | x &lt;- concatMap B.words ls]
    where
    lookup x = case B.readInt x of Just (v,_) -&gt; fromIntegral v
                                   Nothing -&gt; error "Unparsable qual value"
</tt></pre>
<p>This isn&#8217;t a lot more complicated than our initial attempt, but the speed increas is considerable: slightly less than 2 seconds for the test file, more than a tenfold improvement.  The ByteString implementation will share the storage of the separate words with the original strings, but since <strong>readInt</strong> gives us back the rest of the string in addition to the parsed integer, we might as well make use of it:</p>
<pre><tt>
BB.pack $ readInts $ B.unlines ls
    where readInts xs = case B.readInt xs of </tt></pre>
<pre><tt>                          Just (i,rest) -&gt; fromIntegral i : readInts (B.dropWhile isSpace rest)
                          Nothing -&gt; []
</tt></pre>
<p>This turns out to be a bit faster, time is now 1.6 seconds.  Another 20% shaved off.</p>
<h3>Final version</h3>
<p>We&#8217;re really only interested in <strong>Word8</strong> values, since quality values always are small, and since that&#8217;s what gets encoded in the result anyway.  The previous versions takes a detour by reading <strong>Int</strong>s and using <strong>fromIntegral</strong> to convert them to the desired size.  It bears noting that there is no error checking involved, <strong>fromIntegral</strong> will happily and silently truncate any number beyond its target range.  So lets do things explicitly, using <strong>Word8</strong>s throughout the computation:<br />
<tt> </tt></p>
<pre><tt>
BB.pack $ go 0 ls
    where
    isDigit x = x &lt;= 58 &amp;&amp; x &gt;= 48
    go i (s:ss) = case BB.uncons s of </tt></pre>
<pre><tt>                    Just (c,rs) -&gt; if isDigit c then go (c - 48 + 10*i) (rs:ss)
                                   else let rs' = BB.dropWhile (not . isDigit) rs
                                        in if BB.null rs' then i : go 0 ss else i : go 0 (rs':ss)
                    Nothing -&gt; i : go 0 ss
    go _ [] = []
</tt></pre>
<p>This is the fastest one so far, clocking in at 0.94 seconds, over 40% faster than the best <strong>readInt</strong> version, and about 25 times faster than the naive version.  Still, 10MB/s is well below the average hard disk.</p>
<p>So is there more room for improvement?  The most obvious wart to me is the rather artificial splitting into lines.  This is mostly an artifact of some early design desicions, and it should be possible to eliminate the splitting earlier on and saving even more time by simplify this function quite a bit.</p>
<p>If you spot anything else, or have suggestions, I (and my darcs repo) am all ears.</p>
<p><strong>Edit:</strong> Since some people have asked, I&#8217;ve wrapped up a simple test program along with some test files at <a href="http://malde.org/~ketil/biohaskell/qualparsetest ">http://malde.org/~ketil/biohaskell/qualparsetest</a>.  This is a simplified version, if you want to be <em>really</em> helpful, you could always look at <strong>Bio.Sequence.Fasta</strong> in the <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">Bioinformatics library</a> and see if you can speed up e.g. <em>dephd -i input.fasta input.qual -F /dev/null</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/08/31/parsing-ints/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A set of tools for working with 454 sequences</title>
		<link>http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/</link>
		<comments>http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/#comments</comments>
		<pubDate>Fri, 03 Jul 2009 21:32:10 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[sequence analysis]]></category>
		<category><![CDATA[sff]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/</guid>
		<description><![CDATA[Pyrosequencing is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional Sanger sequencing as previous-generation sequencing) as it produces large amounts of sequences at lower costs.  As the technology is radically different, so are the type of data that results from it, and while it is possible [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Dhalia_flower.jpg/800px-Dhalia_flower.jpg" alt="Random flower courtesy of wikimedia" width="212" height="158" align="left" /><a title="Pyrosequencing from Wikipedia" href="http://en.wikipedia.org/wiki/Pyrosequencing">Pyrosequencing</a> is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional <a title="Old school sequencing, Wikipedia" href="http://en.wikipedia.org/wiki/DNA_sequencing">Sanger sequencing</a> as previous-generation sequencing) as it produces large amounts of sequences at lower costs.  As the technology is radically different, so are the type of data that results from it, and while it is possible to use many of the same software tools for working with new sequences, there is a clear need for specific ones as well.</p>
<p>The <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/">Haskell bioinformatics library </a>has for some time now supported reading and writing the SFF format, which is used by the oldest (previousest?) of the next generation technologies, namely Roche&#8217;s 454 sequencing.  Once the library functionality is in place, it is easy to develop small tools for doing the various chores.   After spending some time in anticipation of the hordes of programmers no doubt rushing to exploit the monumental effort put I down in  the library, I&#8217;ve instead written a few programs myself, including tools information/statistics extraction (flower),  extracting sequences by various criteria (fselect), simulating sequencing (pyrosim), and repairing broken SFF files (frepair).  This is their story.<span id="more-45"></span></p>
<p>The bioinformatics library includes functions and data structures to read and write and extract data from SFF files.  It provides reasonable performance (meaning that in most cases, my disk is the limiting factor).  It is also a clean room implementation, built from <a title="454 manual (see page 528++)" href="http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf">official documentation</a>, but not based on Roche&#8217;s (or anybody else, for that matter) code.  It comes with an LGPL license, which I hope will make it useful while encouraging back contributions.</p>
<h3>Flower</h3>
<p><a title="Previous article on flower" href="http://blog.malde.org/index.php/flower/">This program</a> can extract  various information from SFF files.  To some extent, it overlaps with &#8217;sffinfo&#8217; and &#8217;sfffile&#8217; from 454, but in addition to generating Fasta sequences (optionally with quality), it can also extract directly to the more compact FastQ format.  It also can output a (huge) table of flow values, the histogram of flow values (useful for estimating the flow distributions, and thus probabilities for the base calls), or a one-line summary of each read that includes some statistics on lenghts and quality.  Here&#8217;s the usage info:</p>
<blockquote>
<pre>flower: Usage: flower -[f|q|r|R] &lt;file.sff&gt; [&lt;file2.sff&gt; ..]
  -r  output reads in Fasta format
  -R  output reads in Fasta format with associated .qual
      (generates files instead of writing to &lt;stdout&gt;)
  -q  output in FastQ format
  -f  output the flowgram in tabular format
  -h  output a histogram table of flow values
  -s  output a summary of each read</pre>
</blockquote>
<h3>FSelect</h3>
<p>FSelect takes an SFF file and produces a new SFF file containing a subset of the sequences, using the same statistics as Flower can output.  It has a small expression language built in, so that you can build more complex logical queries.  For instance, if you want to extract the sequences with lenght between 300 and 400, and with a K² score greater than 0.7, you could do</p>
<blockquote>
<pre>fselect "And (Func LT k2 0.7) (And (Func GT len 200) (Func LT len 400))" FL61AHU01.sff</pre>
</blockquote>
<p>Okay, it&#8217;s a bit clunky, but the syntax should be reasonably straightforward: Logical operators are <tt>And</tt>, <tt>Not</tt>, and <tt>Or</tt>, while <tt>Func op f v</tt> defines a function using <tt>op</tt> (either <tt>LT</tt> or <tt>GT</tt>) to compare <tt>f </tt>(one of <tt>k2, len, tlen, ncount</tt>)  to each read.  Output is generated in a file named &#8220;selected.sff&#8221;.<br />
FSelect can also select random reads, using the select expression <tt>"Rand p"</tt>, where p is the probability for selection.  (I.e. <tt>fselect "Rand 0.2" FL61AHU01.sff </tt> will select each read with a probability of 0.20, giving you approximately 20% of the reads.  Random selection can not be combined with other criteria at this point, if you want this, you&#8217;ll have to run <tt>fselect</tt> multiple times.</p>
<h3>FRecover</h3>
<p>We had some issues with broken SFF files, specifically there were block of zero bytes at random places. Both <tt>flower</tt> and <tt>sffinfo</tt> just terminate on encountering a broken read, so I implemented a simple utility that attempts to skip the broken block and continue extracting good reads beyond the trouble.</p>
<h3>PyroSim</h3>
<p>This attempts to simulate pyrosequencing, for now it only does the GS20  generation of 454 sequencing, but the other generations should be easy to add.  The main problem is that the algorithm for quality calling is insufficiently documented.  GS20 has the advantage that quality for a homopolymer is uniquely derived from the flow value (modulo rounding), so reverse engineering it is fairly straightforward.</p>
<p>Anyway, <tt>pyrosim</tt> takes a &#8216;generation&#8217;  (at the moment, this is only GS20) and a Fasta file as input parameters, picks random points in the Fasta file, and produces the correspondig flowgram, including a suitable perturbation of the values to introduce the expected measure of noise, calls the bases and quality, and outputs an SFF file.</p>
<h3>Availability</h3>
<p>Flower, FSelect and FRecover are part of the &#8220;flower&#8221; package, available from the <a title="Darcs revision control system" href="http://darcs.net/">darcs</a> archive at <a title="flower darcs repo" href="http://malde.org/~ketil/biohaskell/flower">http://malde.org/~ketil/biohaskell/flower</a></p>
<p>PyroSim is available separately (for now, at least) from the darcs archive at  <a title="pyrosim darcs repo" href="http://malde.org/~ketil/biohaskell/pyrosim">http://malde.org/~ketil/biohaskell/pyrosim</a></p>
<p>I try to upload what I consider stable versions to HackageDB, please check the <a title="Bioinformatics at Hackage" href="http://hackage.haskell.org/packages/archive/pkg-list.html#cat:bioinformatics">Bioinformatics</a> category there.  Currently, these programs are in a bit of flux, so going with the darcs repo is probably a good idea at this point.</p>
<p>I&#8217;d really like to have packages for the most common Linux distributions as well (i.e. .debs and .rpms), but I don&#8217;t know the details of how to produce them, and while I&#8217;ve made half-hearted attempts in the past, I guess I just don&#8217;t really desire it enough.  I&#8217;d be happy to see somebody package it up, so if you know how, please go ahead.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/07/03/a-set-of-tools-for-working-with-454-sequences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dephd updates</title>
		<link>http://blog.malde.org/index.php/2009/06/16/dephd-updates/</link>
		<comments>http://blog.malde.org/index.php/2009/06/16/dephd-updates/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 12:14:02 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[EST analysis]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/06/16/dephd-updates/</guid>
		<description><![CDATA[Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the basecaller phred, but it has since grown a bit beyond that.  A new update was just pushed onto HackageDB, this is just a quick note describing new features.
Filtering out empty sequences.
Phred often produces [...]]]></description>
			<content:encoded><![CDATA[<p>Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the <a href="http://en.wikipedia.org/wiki/Phred_base_calling" title="Phred base calling on Wikipedia">basecaller <tt>phred</tt></a>, but it has since grown a bit beyond that.  A new update was just pushed onto <a href="http://hackage.haskell.org/package/dephd" title="dephd on Hackage">HackageDB</a>, this is just a quick note describing new features.<span id="more-44"></span></p>
<h3>Filtering out empty sequences.</h3>
<p><tt>Phred</tt> often produces zero-length sequences, and this confuses other programs.  While <tt>BLAST</tt> will just output a warning, <tt>SeqClean</tt> &#8212; or to be precise, <tt>cln2qual</tt> &#8212; will break down.   (My own code using the Bioinformatics library treats all sequences the same regardless of length, so zero-length sequences are perfectly okay). Anyway, you can now use <tt>dephd -z</tt> to eliminate them from the output.</p>
<h3>Sequence Clipping</h3>
<p>Sequence trimming or clipping is often necessary to remove contamination like vector sequence, or simply low quality sequence parts.  Typically, both of these occur at the ends of the sequences.  Many programs (including dephd, but also phred, lucy, seqclean and others) add trimming information to the sequence header.  Dephd is now able to act on this information and clip the sequences.  The trimming information is now obsolete as the coordinates have changed, so they are replaced with the clipping coordinates.   Dephd also provides its own quality assessment, and with the -q option, sequence ends where the sliding windown average quality is below 15 will be clipped.  This is pretty heavy-handed, but it seems I get better EST clustering with this enabled.</p>
<h3>Old features</h3>
<p>Of course we retain the old features: reads PHD and Fasta/Qual files, mask (to lower case/N but don&#8217;t clip) by quality, generate quality plots, outputs Fasta/Qual, and ranking sequences by quality.</p>
<p><em>Edit:</em> In the latest release, there&#8217;s now also a fix for a problem with drawing quality graphs with gnuplot.  It turns out that my shiny new Ubuntu ships with gnuplot 4.2, but your crappy old distribution ships with an older version, and that there are some incompatibilities in the input formats.  I&#8217;ve now reverted this to use old-style format only, so hopefully it should work with gnuplots back to 3.7 or so.   And for those SLS or MCC interrim die-hards out there, I&#8217;ll even add an option to dump the gnuplot file itself, so that you can copy it to a floppy and generate the plots on a computer with a modern color display. How&#8217;s that for user friendly?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/06/16/dephd-updates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using a phantom type to label different kinds of sequences</title>
		<link>http://blog.malde.org/index.php/2009/05/14/using-a-phantom-type-to-label-different-kinds-of-sequences/</link>
		<comments>http://blog.malde.org/index.php/2009/05/14/using-a-phantom-type-to-label-different-kinds-of-sequences/#comments</comments>
		<pubDate>Thu, 14 May 2009 11:00:14 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[phantom types]]></category>
		<category><![CDATA[sequences]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/05/14/using-a-phantom-type-to-label-different-kinds-of-sequences/</guid>
		<description><![CDATA[Until now (version 0.3.5) of the bioinformatics library, the Sequnce data type has essentially been a wrapper around a couple of strings, with only the most rudimentary and generic structure.  This has the advantage that you can easily work with different kinds of sequences without caring about the particulars, but of course, nothing stops you [...]]]></description>
			<content:encoded><![CDATA[<p>Until now (version 0.3.5) of the bioinformatics library, the Sequnce data type has essentially been a wrapper around a couple of strings, with only the most rudimentary and generic structure.  This has the advantage that you can easily work with different kinds of sequences without caring about the particulars, but of course, nothing stops you from comparing a nucleotide sequence to a protein letter by letter.  We&#8217;d like some more safety without sacrificing flexibility, and by using phantom types we can get this.  Below is my attempt at implementing this.</p>
<p><span id="more-42"></span> The safety that we seek can be exemplified by sequence alignment or similarity scores, as calculated by e.g. the BLAST suite of programs.  Here we have e.g. blastn for comparing nucleotide sequences, blastp for comparing amino acid sequences, and blastx for comparing nucleotides to proteins by first translating the nucleotides to the possible corresponding amino acid sequences. We&#8217;ll aim for an align function that does the right thing for the various combination of sequence types.</p>
<h3>The old approach</h3>
<p>Previously, there was only one Sequence data type, defined as:</p>
<pre>
data Sequence = Seq SeqLabel SeqData (Maybe QualData)</pre>
<p>(where <tt>SeqLabel</tt>, <tt>SeqData</tt> and <tt>QualData</tt> are represented as various <tt>ByteStrings,</tt> but you can think of them as synonyms for <tt>String</tt>)</p>
<p>As mentioned, this isn&#8217;t really leveraging the type system much, and at the very least, it&#8217;d be a good thing to be able to differentiate between nucleotide and  amino acid sequence (a.k.a. peptides or proteins).</p>
<h3>The algebraic data type approach</h3>
<p>The default functional programming approach would be to use an algebraic (sum) type, what the imperative programmers woudl call a union.</p>
<pre>
data Sequence = Nucleotide ... | Protein ...</pre>
<p>This is a good solution when the two branches have different structure, but here you&#8217;d essentially just repeat the structure.  Moreover, all functions will need to do run-time checks for each case, imposing a cost both in function complexity, running time, and type safety.  Note that performing the alignment usually requires a score matrix describing the penalty for replacing any given character with any other.</p>
<pre>
align :: Matrix -&gt; Sequence -&gt; Sequence -&gt; Alignment

align mx (Nucleotide ...) (Nucleotide ...) =
align mx (Nucleotide ...) (Protein ...) =
align mx (Protein ...) (Nucleotide ...) =
align mx (Protein ...) (Protein ...) =</pre>
<p>This is not even complete, since a similar restriction applies to the score matrices &#8211; sometimes it will contain replacement penalties for the nucleotide alphabet, and sometimes it will contain penalties for the amino alphabet, and the appropriate matrix must be used in each brach of the align function.</p>
<p>Another problem with this is that this requires you to over-specify the type.  Sometimes you don&#8217;t know or care what kind of sequence you have.  Say you are selecting sequences by name from a Fasta file.  Since the file format is the same you don&#8217;t really care what kind of sequence it is, and forcing it to be one or the other is&#8230;.immoral.</p>
<h3>The third way: phantom types</h3>
<p>The chosen approach is instead to tag the Sequence type with a phantom type parameter, phantom meaning it does not affect the actual representation of the data.  It looks like this:</p>
<pre>data Sequence t = Seq ....  -- but no data member of type 't'!</pre>
<p>Now, we can write our alignment functions, and make them safer to use:</p>
<pre>align :: Matrix t -&gt; Sequence t -&gt; Sequence t -&gt; Alignment
alignX :: Matrix Amino -&gt; Sequence Nuc -&gt; Sequence Amino -&gt; Alignment</pre>
<p>Note that we also phantom-typed the Matrix type.  With this approach,  incorrect usage like comparing sequences of different type with the generic align will be flagged by the compiler, so run time checks are no longer necessary.</p>
<p>On the other hand, readFasta can now be given the type:</p>
<pre>readFasta :: Filepath -&gt; IO (Sequence a)</pre>
<p>or:</p>
<pre>readFasta :: FilePath -&gt; IO (Sequence Unknown)</pre>
<p>depening on how much you trust the programmer.  Since I&#8217;m writing most of the programs using this library, I know how much you can trust application programmers, so the second and safer method is chosen.</p>
<h3> In practice</h3>
<p>The old biolib repository has now been replaced by two new ones: biolib-stable, currently containing version 0.3.5, and biolib-unstable at 0.4.0. Perhaps unsurprisingly, the latter version contains the phantomly typed Sequence definition.</p>
<p>Currently, three types are used for tags:  Amino, which is the type for the amino acid alphabet,and Nuc and Unknown, which are dummy types without any data constructors, and used solely for this purpose.</p>
<p>So what are the experiences so far?  Well, it does complicate things.   For some file types the sequence type is known, for instance the output of nucleotide sequencing machines in the form of ABI, SCF, or SFF files.  But often, file formats are agnostic with respect to sequence types, the most ubiquitous offender being the Fasta format. Currently, this is handled by the Unknown type tag, but I&#8217;m not entirely convinced this is the optimal solution.</p>
<p>Code needs to be updated, but mostly this is relatively easy.  The type system will spot the difficulties, and often you can get by by just replacing Sequence in the type signatures with Sequence t.  Just make sure that t isn&#8217;t already used in the signature &#8211; I stumbled into this one.</p>
<p>Of course, this is one step towards increased leveraging of the type system.  One could go further, and  tag different type of nucleotide sequences based on sequencing technology: Sanger sequencing has quite different error characteristics from 454 sequencing, and Solexa has a completely different interpretation of the quality values than either of those.  Also, one might want to include the presence or absence of quality data in the type as well.  There are also different amino acid alphabets &#8211; should they have different types?  This is a difficult design space &#8211; how much information should be encoded in the type?</p>
<p>While I think this is an improvement, I&#8217;m very curious how this works out in practice, or if there are other options I should consider.  Please comment!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/05/14/using-a-phantom-type-to-label-different-kinds-of-sequences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Current developments&#8230;</title>
		<link>http://blog.malde.org/index.php/2009/03/13/current-developments/</link>
		<comments>http://blog.malde.org/index.php/2009/03/13/current-developments/#comments</comments>
		<pubDate>Fri, 13 Mar 2009 15:59:45 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/03/13/current-developments/</guid>
		<description><![CDATA[In my vacation, I experimented with phantom types for the Sequence data type.  Bascially, we want nucleotide and protein sequences to have the same representation, and mostly use the same algorithms, but sometimes we need to distinguish them, so as not to inadvertently treat a protein as a nucleotide sequence.  A more detailed writeup is [...]]]></description>
			<content:encoded><![CDATA[<p>In my vacation, I experimented with phantom types for the Sequence data type.  Bascially, we want nucleotide and protein sequences to have the same representation, and mostly use the same algorithms, but sometimes we need to distinguish them, so as not to inadvertently treat a protein as a nucleotide sequence.  A more detailed writeup is in the works, but currently, I&#8217;ve pushed the darcs repo to <a href="http://malde.org/~ketil/biohaskell/biolib-phantom/">http://malde.org/~ketil/biohaskell/biolib-phantom/</a> so if everything works out, this will be the next release (0.4).  (Note to self: we now have a stable and a development branch.  Almost like a serious and all grown up software project. Professionality &#8211; Yay!)</p>
<p>Also, since short reads are all the rage, and my flower program appears to be used a bit, I&#8217;ve done a quick writeup of its features and use as a <a href="http://blog.malde.org/index.php/flower/" title="Flower page">static page</a>.  I&#8217;ll try to keep it updated as things progress.  Popularity &#8211; Yay!</p>
<p>Finally, I got some help compiling everything on some less mainstream operating systems (&#8220;Windows&#8221;, I think it is called).  Mostly, things appear to work, and some improvements &#8211; albeit portability-neutral ones &#8211; were made.  Portability  &#8211; Yay!.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/03/13/current-developments/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from PADL</title>
		<link>http://blog.malde.org/index.php/2009/02/11/notes-from-padl/</link>
		<comments>http://blog.malde.org/index.php/2009/02/11/notes-from-padl/#comments</comments>
		<pubDate>Wed, 11 Feb 2009 09:10:06 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2009/02/11/notes-from-padl/</guid>
		<description><![CDATA[Time flees.  It&#8217;s already been a while since PADL in Savannah, where I had the opportunity to enjoy talks in topics I mostly managed to follow and meet interesting and interested people.  Thanks to the organizers and committees making it all possible.  I presented a paper on Bloom filters that Bryan O&#8217;Sullivan and I wrote, [...]]]></description>
			<content:encoded><![CDATA[<p>Time flees.  It&#8217;s already been a while since PADL in Savannah, where I had the opportunity to enjoy talks in topics I mostly managed to follow and meet interesting and interested people.  Thanks to the organizers and committees making it all possible.  I presented a paper on Bloom filters that Bryan O&#8217;Sullivan and I wrote, and thought I&#8217;d make the paper available along with the slides (expanded somewhat), and a couple of ideas for extending Bloom filters that I think are original (or &#8220;novel&#8221;, as they say in science).</p>
<p><span id="more-31"></span>First things first, here are the files:<a href="http://blog.malde.org/wp-content/uploads/2009/02/padl.pdf" title="Bloom filters for bioinformatics"></a></p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/padl.pdf" title="Bloom filters for bioinformatics">Bloom filters for bioinformatics</a> &#8211; paper</p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/bloomfilter.pdf" title="Bloom filters for bioinformatics">Bloom filters for bioinformatics</a> &#8211; slides</p>
<p>The new addition to the slides consist of the counting bloom filters, and a brief overview on how to locate matches in a Bloom filter.  The standard counting Bloom filter consists of replacing the array of bits with an array of bit buckets (of size <em>b</em>, say).  This lets the filter count up to <em>2^b</em> occurrences of each element, typically using saturating counts.  If you want to retain the false positive rate, this means you&#8217;ll expand the size of the Bloom filter by a factor of <em>b</em>, which can be quite significant. In the example below, <em>b</em> is set to 3, and the filter saturates at a count of 7.</p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/bloom_filter_counting.png" title="counting bloom filter"><img src="http://blog.malde.org/wp-content/uploads/2009/02/bloom_filter_counting.png" alt="counting bloom filter" style="background: white none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial" /></a></p>
<p>Therefore, I propose a different kind of counting bloom filter, using a distributed count.  Keep in mind that we have an infinite sequence of hash functions.  When inserting a new value, we can check the <em>k</em> hash functions to see if it&#8217;s already there.  If so, we calculate some <em>additional</em> hash functions to represent the count of two for this value.  And so on, until we find a count (and corresponding set of hashes) that have at least one zero bit, set those bits to one, and move on to the next value.  It may make sense here to decrease the number of hash values as the count goes up.  In the example below, we see that x is already inserted, and two new hash values are calculated to increase the count.</p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/bloom_filter_counting2.png" title="distributed counting bloom filter"><img src="http://blog.malde.org/wp-content/uploads/2009/02/bloom_filter_counting2.png" alt="distributed counting bloom filter" style="background: white none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial" /></a></p>
<p>There&#8217;s a trade off, of course.  Insertion and lookup is no longer (as) constant, but  proportional to value count. However, the Bloom filter is likely to be more compact &#8211; given <em>n</em> values in total, where <em>u</em> are unique, the standard counting filter needs a size proportional to <em>bu</em>, while the distributed counting filter needs size proportional to <em>n</em>.  If you want to count accurately, <em>b</em> needs to be set to the logarithm of the max count of any value.</p>
<p>Finally, it&#8217;s also possible to use Bloom filters for searching &#8211; for instance, locating unique words in a genome. A lookup in a Bloom filter basically gives one bit of information &#8211; present or not present.  We use this with a series of <em>m</em> Bloom filters, each indexing the regions of the genome corresponding to one bit of location information.  Looking up a unique word in the set of filters reveals the location with<em> m</em> bits of precision.</p>
<p><a href="http://blog.malde.org/wp-content/uploads/2009/02/locating.png" title="locating bloom filters"><img src="http://blog.malde.org/wp-content/uploads/2009/02/locating.png" alt="locating bloom filters" style="background: white none repeat scroll 0% 0%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2009/02/11/notes-from-padl/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The wee beginnings of a biohaskell tutorial? &#8212; and some thoughts on programming productivity</title>
		<link>http://blog.malde.org/index.php/2008/08/14/the-wee-beginnings-of-a-biohaskell-tutorial-and-some-thoughts-on-programming-productivity/</link>
		<comments>http://blog.malde.org/index.php/2008/08/14/the-wee-beginnings-of-a-biohaskell-tutorial-and-some-thoughts-on-programming-productivity/#comments</comments>
		<pubDate>Thu, 14 Aug 2008 12:35:16 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/08/14/the-wee-beginnings-of-a-biohaskell-tutorial-and-some-thoughts-on-programming-productivity/</guid>
		<description><![CDATA[In my copious spare time, I&#8217;ve started putting together a tutorial for using the biohaskell library.  It&#8217;ll probably take some time &#8212; anything from a while to an eternity &#8212; until it&#8217;s complete, but I thought I&#8217;d follow the adage of &#8220;release early, release often&#8221; in the hope that the intermediate product may prove useful [...]]]></description>
			<content:encoded><![CDATA[<p>In my copious spare time, I&#8217;ve started putting together a tutorial for using the <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/" title="BioLib page">biohaskell library</a>.  It&#8217;ll probably take some time &#8212; anything from a while to an eternity &#8212; until it&#8217;s complete, but I thought I&#8217;d follow the adage of &#8220;release early, release often&#8221; in the hope that the intermediate product may prove useful to somebody, somewhere.  It shouldn&#8217;t really require any prior knowledge of Haskell nor of biology.  Anyway, it&#8217;s <a href="http://blog.malde.org/index.php/biohaskell-tutorial-part-1-sequence-data/" title="BioHaskell tutorial, part 1">here</a>, please take a look, and tell me what you think!</p>
<p>In other news, inspired by an article on programmer productivity at <a href="http://lwn.net/Articles/293037/" title="Val Henson reviews ">LWN</a>, I ran <a href="http://www.cse.unsw.edu.au/~dons/darcs-graph.html" title="The darcs-graph home page">darcs-graph </a>on my code to <a href="http://malde.org/~ketil/biohaskell/activity.html" title="darcs-graph output">see how I do</a>.  I guess I consider myself an about average programmer, and it looks like I can average about five commits a day when I&#8217;m working on a project.  I&#8217;m occasionally touching 20 commits, but that&#8217;s probably a built-up backlog of patches.  Let&#8217;s see where this puts me:</p>
<p><span id="more-24"></span></p>
<p>If you accept the proposition that the best programmers outperform the poor ones by a factor of 50, an average programmer should outperform a poor one by a factor of about seven, and also be outperformed by a top-notch one by the same amount.  So if I&#8217;m about average, the best ones should be able to beat me by a factor of about seven, which means 35 commits a day, or about ten minutes per commit throughout the day.  Scaling up my peak commit at 20/day would give 140 commits, about 20 per hour, or on average 3 minutes between each.  Conversely, a lousy programmer should be struggling hard to make a commit at all during his working day.  Both ends of the spectrum seem a bit incredible, but I guess the lousy end is slightly more so, and I have to accept the fact that I&#8217;m officially a below-average programmer.  But before I get all depressed about it, I must point out that on the <a href="http://www.cse.unsw.edu.au/~dons/images/commits/community/" title="darcs-graph for various projects">Haskell community overview</a>, no project gets anywhere near 35 commits on average.  The busiest is GHC, which touches 15 commits.  The most likely explanation is that the 50:1 ratio is way off the mark, and that the old 10:1 ratio is closer to the mark.</p>
<p>As an aside: Productivity is often measured in source lines of code (<a href="http://en.wikipedia.org/wiki/Source_lines_of_code" title="Wikipedia definition, incl. criticism">SLOC</a>), which is duly criticized for being imprecise.  For instance, some of the most important and beneficial changes <em>remove</em> lines, what&#8217;s that, <em>negative </em>productivity?  In contrast, I find that I really like the number of commits approach.  A commit &#8212; at least for me &#8212; is a small piece of modification, usually one to maybe ten lines of effective changes.  As such, it represents sort of a minimal, atomic modification to the code, and encapsulates the smallest coherent unit of brain sweat.  In the <a href="http://en.wikipedia.org/wiki/Source_lines_of_code" title="Wikipedia definition of SLOC.  Didn't you already check this?">Wikipedia page linked to above</a>, Bill Gates is quoted as saying &#8220;<em>Measuring programming progress by lines of code is like measuring aircraft building progress by weight.</em>&#8220;  In contrast, measuring programming progress by commits is like measuring aircraft building progress by parts &#8212; which I think is much more sensible.  Of course, other people may have different notions on the granularity of patches.  Most of what I&#8217;ve seen seems to agree with mine, though.</p>
<h2>(Even more) random notes</h2>
<p>Here&#8217;s a rather <a href="http://www.joelonsoftware.com/articles/HighNotes.html" title="Joel on software: Hitting the high notes">entertaining read</a> on productivity, although it uses time and quality as metrics, and the standard error for the listed projects is about 0.5 (i.e. time use is about 20 hours ± 10 hours), with a factor of 10 between the worst and best programmer average.  And, disregarding the span from worst to best, the <a href="http://www.webfoot.com/blog/2008/02/06/vandev-talk-summary/" title="Talk summary by Kaitlin Duck Sherwood.">shape of the distribution</a> has certain ramifications, too.</p>
<p>One of the things I remember from Brooks is that he ascribed one order of magnitude productivity improvement from moving from assembly to a high level language.  Further, he thought that no such improvement was possible again from moving to yet higher level languages.  I tend to agree with <a href="http://t-a-w.blogspot.com/2007/02/yanniss-law-programmer-productivity.html" title="Yanniss' law on programmer productivity increase">Yanniss&#8217; law</a> that this is too pessimistic, although I also think a doubling every six years is too optimistic.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/08/14/the-wee-beginnings-of-a-biohaskell-tutorial-and-some-thoughts-on-programming-productivity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Updates and other trivialities</title>
		<link>http://blog.malde.org/index.php/2008/07/31/updates-and-other-trivialities/</link>
		<comments>http://blog.malde.org/index.php/2008/07/31/updates-and-other-trivialities/#comments</comments>
		<pubDate>Thu, 31 Jul 2008 15:05:41 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/index.php/2008/07/31/updates-and-other-trivialities/</guid>
		<description><![CDATA[Just some quick notes:
Hackage submissions updated
There seems to have been problems with some of the bioinformatics applications on Hackage, thanks to Don S. for pointing it out.  That should be fixed now by new uploads, but I&#8217;m still waiting for the automatic builds to register results.  An, since you ask, it was all [...]]]></description>
			<content:encoded><![CDATA[<p>Just some quick notes:</p>
<h2>Hackage submissions updated</h2>
<p>There seems to have been problems with some of the bioinformatics applications on Hackage, thanks to Don S. for pointing it out.  That should be fixed now by new uploads, but I&#8217;m still waiting for the automatic builds to register results.  An, since you ask, it was all my fault for being sloppy with version dependencies.  It&#8217;d all work with a recent biolib.  Speaking of which,</p>
<h2>A home page for the bioinformatics library</h2>
<p>I&#8217;ve finally updated the static home page for the library, it can be admired (especially if you remember what Oscar Wilde had to say about that) <a href="http://blog.malde.org/index.php/the-haskell-bioinformatics-library/" title="biolib home page">here</a>.</p>
<h2>I&#8217;ve discovered HPC</h2>
<p>No, not high-performance computing (<a href="http://www.cs.toronto.edu/~gvwilson/articles/hpc-considered-harmful-2008.pdf" title="HPC considered harmful">which sucks anyway</a>), but GHC&#8217;s new <a href="http://www.haskell.org/hpc" title="Haskell Program Coverage">ability to do coverage profiling</a>.  Adding it to the default testing procedure was just a couple of extra options, the results can be admired by browsing the <a href="http://malde.org/~ketil/biohaskell/biolib/hpc_index.html" title="HPC information for testing in the biolib">HTML files in the darcs repository</a>.  This is very cool, almost too easy, and looks pretty good. The only downside is that it exposes my sloppy attitude to testing.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/07/31/updates-and-other-trivialities/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Welcome to the biohaskell blog!</title>
		<link>http://blog.malde.org/index.php/2008/04/25/welcome-to-the-biohaskell-blog/</link>
		<comments>http://blog.malde.org/index.php/2008/04/25/welcome-to-the-biohaskell-blog/#comments</comments>
		<pubDate>Fri, 25 Apr 2008 18:46:21 +0000</pubDate>
		<dc:creator>ketil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malde.org/2008/04/25/welcome-to-the-biohaskell-blog/</guid>
		<description><![CDATA[
The last few years, I&#8217;ve been working in bioinformatics, the science of taking the results of hard working computer scientists and applying it to the results of equally hard working biologists.  You probably think this looks like an easy route to fame and fortune, and you&#8217;d be right.  In fact, most bioinformaticists just [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.malde.org/wp-content/uploads/2008/04/biohaskell-white1.png" title="biohaskell-white1.png"><img src="http://blog.malde.org/wp-content/uploads/2008/04/biohaskell-white1.png" alt="biohaskell-white1.png" width="139" height="132" /></a></p>
<p><span id="more-3"></span>The last few years, I&#8217;ve been working in bioinformatics, the science of taking the results of hard working computer scientists and applying it to the results of equally hard working biologists.  You probably think this looks like an easy route to fame and fortune, and you&#8217;d be right.  In fact, most bioinformaticists just string together some crusty old command line tools with Perl scripts, or even a bit of Java code, and call it a day.</p>
<p>To make things a bit more interesting &#8211; interesting to me, that is, the biologists won&#8217;t know the difference, and most comuter scientist are too busy teaching Java to web programmers to notice &#8211; I&#8217;m using Haskell instead.  I don&#8217;t know why <em>you</em> are using Haskell, but <em>I</em>&#8216;m using it because it put the fun back in programming for me.  (I used to program C++, so it&#8217;s not like I need to do this for a living.)</p>
<p>Anyway &#8211; the intent is that I will document code here &#8211; especially when it is particularly neat or cool in some way.   I&#8217;ll also discuss experiences with the competition, when I say crusty command line tools, I do have particular tools and particular crustiness in mind.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malde.org/index.php/2008/04/25/welcome-to-the-biohaskell-blog/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
