<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Ketil's blog</title>
        <link>http://blog.malde.org</link>
        <description><![CDATA[Various stuff.]]></description>
        <atom:link href="http://blog.malde.org/rss.xml" rel="self"
                   type="application/rss+xml" />
        <lastBuildDate>Mon, 23 Apr 2012 21:00:00 UT</lastBuildDate>
        <item>
    <title>Constructing An Assembly Evaluation Pipeline</title>
    <link>http://blog.malde.org/posts/assembly-evaluation.html</link>
    <description><![CDATA[<p><em>Warning: Although this software works well for me, it is in current development, and hasn’t seen a lot of testing. Thus, I’d be surprised if it installs smoothly and works without a hitch. If you try anyway, please let me know how it goes.</em></p>
<p>Part of my day job has recently been working on a de novo sequencing project. As we are are exploring many methods for genome assembly, it is imporant to be able to evaluate the resulting assembly in order to eliminate poor methods, improve on the more promising ones, and of course in the end select the best assembly. I have implemented a pipeline that performs this evaluation, and this is an attempt to briefly describe its goals and implementation.</p>
<h2 id="measures-and-input-data">Measures and input data</h2>
<p>In order to assess quality, we rely on both internal and external measures. For instance, we can calculate the N50 contig size from the genome assembly directly, making it an internal measure. Alternatively, we can count the number of ESTs we are able to align to the assembly, making it an external measure.</p>
<p>For external measures, we will make use of many different types of data:</p>
<ul>
<li>High-throughput DNA-sequencing reads</li>
<li>Paired-end and mate-pair reads</li>
<li>EST reads</li>
<li>Known DNA fragments, including the mitochondrion genome</li>
</ul>
<h2 id="defining-qualities">Defining qualities</h2>
<p>The pipeline addresses four aspects of assembly quality that ideally are independent of each other:</p>
<ol>
<li><p><em>Completeness</em>: does the genome assembly contain every bit of the genome, or is a part of it missing?</p></li>
<li><p><em>Fragmentation</em>: does the genome come as a few, long, sequences, or as a zillion tiny fragments? And if the former, are the sequences correctly put together?</p></li>
<li><p><em>Accuracy</em>: does the assembled genome sequence contain small-scale errors, from e.g. base calling errors?</p></li>
<li><p><em>Redundancy</em>: is there a one-to-one correspondence between locations in the assembly and the actual genome, or does the assembly contain multiple copies of parts of the genome?</p></li>
</ol>
<p>Note that it isn’t necessary that there is a fixed, optimal value for each measure that indicates a perfect assembly, it is only important that we can compare the scores for different assembly in order to determine which assembly is superior for that particular measure.</p>
<h3 id="completeness">Completeness</h3>
<p>Genomic (DNA) reads are aligned to the genome, and the proportion having matches is reported. We use BWA to align Illumina and 454 reads, and rely on ‘samtools flagstats’ to report the matches.</p>
<p>Also, various other sequences are mapped using BLAT, below we have a set of transcripts assembled from ESTs and RNASeq data, the mitochondrion genome, and fosmid ends. Here we report the average size of each match (as a percentage of the lenght of the query sequence), as well as the total number of sequences matching.</p>
<pre><code>COMPLETENESS:
  Illumina mapped:      77.0411%
  PSL hits:
    all_transcripts_165.rna.psl.uniq:   92.0361%, 28398 matching sequences
    l.salmonis-mtDNA.dna.psl.uniq:      90.1952%, 3 matching sequences
    fosmids_fwd.dna.psl.uniq:   66.002%, 9411 matching sequences
    fosmids_rev.dna.psl.uniq:   63.7143%, 5886 matching sequences</code></pre>
<h3 id="fragmentation">Fragmentation</h3>
<p>Fragmentation is measured by the standard N50 contig size, as well as N25 and N75. In order for it to be comparable, we use a fixed estimate for the genome size, not the size of the assembly itself.</p>
<p>Also, we count the proportion of Illumina reads that have the ‘proper pair’ flag set by BWA. Note that in order to make fragmentation independent of completeness, this is measured as a fraction of the matching reads, not all reads.</p>
<pre><code>FRAGMENTATION:
  N25/50/75:    76308   46116   24010   using 600000000 as total size.
  Illumina pairs proportion:    67.7091%</code></pre>
<h3 id="accuracy">Accuracy</h3>
<p>We measure accuracy by looking at individual scores of alignments, in this case, genomic reads are aligned with BWA, and the alignments score is averaged.</p>
<pre><code>ACCURACY:
  SFF average match:    148.565
  Illum average match:  46.7718</code></pre>
<h3 id="redundancy">Redundancy</h3>
<p>Redundancy looks at the total size of the assembly, comparing it to the estimated size. Then, we count the number of alignments of 454 reads and compare it to the number of query sequences.</p>
<p>Finally, we make use of the BLAT alignments, summing up the alignment scores as a proportion of the best scores for each sequence.</p>
<pre><code>REDUNDANCY:
  Total size: 675758373, 112.626% of estimated size.
  SFF reads mapped:     117.235%
  PSL score, total vs unique:
    all_transcripts_165.rna.psl:            233.564%    67439222/28873929
    l.salmonis-mtDNA.dna.psl:       179.406%    76115/42426
    fosmids_fwd.dna.psl:            1046.08%    67459110/6448777
    fosmids_rev.dna.psl:            963.377%    37459791/3888385</code></pre>
<h2 id="implementation">Implementation</h2>
<p>The pipeline is built as a makefile, which uses a mix of shell, AWK, and the occasional Haskell executable to do most of the heavy lifting.</p>
<p>Make is the widely used build system traditionally used in the Unix world. Essentially, the build process is described using a declarative language, and the <code>make</code> tool works out the process needed to constuct what’s desired.</p>
<p>Some features that makes it especially suitable is that:</p>
<ul>
<li>it avoids redoing work when it’s not necessary</li>
<li>it automatically parallelizes the build process</li>
<li>it is ubiquitously available</li>
</ul>
<h3 id="make">Make</h3>
<p>The basic syntax of <code>make</code> is pretty simple (but never fear, it quickly gets complicated enough!). Here’s an example rule:</p>
<pre><code>foo: bar
	process bar &gt; foo</code></pre>
<p>This tells <code>make</code> that, in order to build the file <code>foo</code>, it first needs the file <code>bar</code>, and when that is in order – either because it exists already, or because <code>make</code> can use some other rule to construct it – <code>foo</code> will be built by running the command <code>process bar</code>.</p>
<p>Make supports pattern-based rules, the archetypal example is probably something like:</p>
<pre><code>%.o: %.c
	$(CC) $&lt;</code></pre>
<p>Which means that we can build any file ending in .o from a prerequisite having the same name, except ending with .c, by running the command line from the variable <code>CC</code> (presumably the C compiler) on the prerequisite (from the built-in variable intuitively named <code>$&lt;</code>).</p>
<p>While this makes it easy to set up rules for building stuff from other stuff, the caveat here is that it’s all pattern based, typically using a wildcard (<code>%</code>) and a suffix. This means that, for instance, paired sequence data in FASTQ <em>must</em> be named <em>something</em><code>.1.txt</code> and <em>something</em><code>.2.txt</code>.<sup><a href="#fn1" class="footnoteRef" id="fnref1">1</a></sup></p>
<p>In addition, <code>make</code> has a set of built-in string manipulation functions, they typically look somewhat like:</p>
<pre><code>STATS    := $(patsubst %.bam,%.stats,$(BAMS))</code></pre>
<p>This populates the variable <code>\$(STATS)</code> with the contents of the variable <code>\$(BAMS)</code>, except that it substitutes .stats for .bam for each element.</p>
<h3 id="shell">Shell</h3>
<p>The commands in make files are executed by the shell, so it’s not too difficult to do things like loops. For instance, you can do:</p>
<pre><code>all-fstats: $(FSTATS)
	for a in $(FSTATS); do echo &quot;$a	`grep % $a | tr &#39;\n&#39; &#39;\t&#39;`&quot;; done &gt; $@</code></pre>
<p>I.e., to build the file <code>all-fstats</code>, we first make sure we have everything in the <code>\$(FSTATS)</code> variable, then we loop over them, doing some <code>grep</code> and <code>tr</code> stuff, and directing output into <code>\$@</code>, yet another intuitively named variable, this time referring to the target of the current rule, or in this case <code>all-fstats</code>. Note that to refer to shell variables, they must be prefixed with an extra <code>\$</code>, a single dollar refers to makefile variables.</p>
<h3 id="awk">AWK</h3>
<p>AWK is useful to collect data and perform arithmetic. For instance, we construct a table summarizing the interesting stats from <code>all-fstats</code> like so:</p>
<pre><code>all-fstats.tab: all-fstats
	sed -e &#39;s/\t[^\t]*(\([0-9][0-9]*\.[0-9]*%\))/\t\1/g&#39; &lt; $&lt; | \
		awk &#39; BEGIN { OFS=&quot;\t&quot; } { print; mapped += $2; paired += $3; single += $4 } \
		END { print &quot;totals:    &quot;,mapped/NR &quot;%&quot;,paired/NR &quot;%&quot;,single/NR &quot;%&quot; }&#39; &gt; $@</code></pre>
<p>AWK is used here to accumulate sums of the various variables, and to calculate and print the averages in the end.</p>
<h3 id="haskell">Haskell</h3>
<p>There are also some tools written in Haskell that performs various processing. This includes <code>psluniq</code>, which extracts the best hit from PSL files, <code>atcounts</code>, which calculates GC/AT-ratio for each contig.</p>
<p>The whole thing is distributed as a Cabal library, so the required Haskell tools will be compiled or pulled in as dependencies on installation. Note that you’ll probably have to add <code>$HOME/.cabal/bin</code> to your <code>$PATH</code>.</p>
<h3 id="some-thoughts">Some thoughts</h3>
<p>Make is Unix’s answer to democracy: it’s clearly the worst build tool out there, except for the others. It is often criticised for not being a full programming language, but I think this is a <em>Good Thing</em>: a more restrictive language is easier to get right. In fact, most of the problems of make stems from it having all kinds of extra functionality bolted on top.</p>
<p>One difficulty is the need to juggle between different paradigms, for instance, the various string processing functions in make go a long way in spite of all their messy syntax. But you still can’t escape some munging with AWK or Perl or similar tools. Wouldn’t it be nice if this were unified? Similarly, you have different layers with different sets of variables. Perhaps the ideal tool is make rules with Perl implementations?</p>
<p>An interesting alternative is Neil Mitchell’s <em>shake</em>, which does all of this in Haskell. In addition, it traces dependencies indirectly, so no more problems with forgotten dependencies. If or when I get around to trying out the alternatives, that’s clearly at the top of my list.</p>
<h2 id="in-practice">In practice</h2>
<h3 id="getting-it">Getting it</h3>
<p>The pipeline is available as a <a href="http://darcs.net">darcs</a> repository, you should be able to get it simply by doing</p>
<pre><code>darcs get http://malde.org/~ketil/asmeval</code></pre>
<p>Of course, you’ll also need a bunch of tools, including the <a href="http://haskell.org/ghc">GHC</a> compiler, the <a href="http://biohaskell.org/Libraries/biopsl">biopsl</a> library, aligners BWA and BLAT, samtools, etc. Hopefully, you already have standard Unix tools like <code>sed</code> and <code>awk</code>.</p>
<p>On debian-derived Linux distributions, you should be able to get at least some of the way doing:</p>
<pre><code>sudo apt-get install cabal-install darcs bwa
cabal update
darcs get http://malde.org/~ketil/asmeval
cd asmeval
cabal install</code></pre>
<p>It looks like you need to install BLAT separately, and I may be missing some stuff - do let me know and I’ll update this post.</p>
<h3 id="configuring-it">Configuring it</h3>
<p>The configuration essentially consists of setting a bunch of make variables. This can be done on the command line, e.g.:</p>
<pre><code>asmeval GENOME=mygenome.fasta ESTS=transcripts.fasta</code></pre>
<p>However, <em>asmeval</em> will also look for a file called <code>CONFIG</code> in its current directory, which also can set these variables. It needs to be in makefile format, and is just included by the main makefile. The above translates to a <code>CONFIG</code> file containing:</p>
<pre><code>GENOME := mygenome.fasta
ESTS   := transcripts.fasta</code></pre>
<p>Note that this lets you use make’s string processing facilities in all their byzantine glory. There is an example (<code>CONFIG.example</code>) provided in the repository that can be used as a template.</p>
<h3 id="running-it">Running it</h3>
<p>After configuration, it should be possible to run everything by executing <code>asmeval</code>, which invokes <code>make</code> with some default options. These include <code>-r</code> to avoid the built-in rules (which only slow things down), and turning on warnings for uninitialized variables, in case you forgot something in the <code>CONFIG</code> file. You can add further options, for instance you may feel inclined to add <code>-j</code> to get parallel processing.</p>
<p>In order to run in the background and collect various information, I often type:</p>
<pre><code>nice nohup time asmeval -j8 &amp;</code></pre>
<p>Note that <code>j</code> allows <code>make</code> to spawn multiple processes in parallel, but does nothing to take advantage of multi-threaded executables like BWA. There is a configuration variable, <code>THREADS</code>, you can set for this, so if an alternative to the above could be</p>
<pre><code>nice nohup time asmeval THREADS=4 -j2 &amp;</code></pre>
<p>This will speed up some of the sub-processes, but run fewer of them in parallel.</p>
<h3 id="the-report">The report</h3>
<p>The default summary is output in a file called <code>report</code>, which contains the main statistics collected about the assembly. Hopefully it is self-explanatory enough to be useful.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>If they aren’t, one solution is to add symbolic links to fix this.<a href="#fnref1">↩</a></p></li>
</ol>
</div>]]></description>
    <pubDate>Mon, 23 Apr 2012 21:00:00 UT</pubDate>
    <guid>http://blog.malde.org/posts/assembly-evaluation.html</guid>
</item>
<item>
    <title>The type system, safety, sequence alignments, and you</title>
    <link>http://blog.malde.org/posts/polymorphic-types-are-safer.html</link>
    <description><![CDATA[<p>Haskell programmers will often make claims that type systems make our programs safer, and especially that careful use of the type system can improve safety further. The first claim tend to be met with counter-claims that the safety isn’t all that important, after all, if you inadvertently try to add a string to a number, tests will quickly catch this. However, static types go a bit beyond this, but it can be hard to communicate how to leverage the type system to somebody who doesn’t already have some hands-on experience with the issues involved.</p>
<p>So I’ll try to illustrate with an example I ran across.</p>
<p>In bioinformatics, sequence alignment (or edit distance) is part of our staple diet. For instance, an unknown protein can be aligned to known proteins to determine its function and structure. Sets of proteins from different species can be aligned against each other to determine phylogenetic lineage. And so on. So let’s look at alignments.</p>
<p>Definitions: an <em>alignment</em> of a pair of sequences is a mapping (<code>f</code>, say) from positions in one sequence to positions in the other. In addition, we require two invariants, first that the mapping is <em>one-to-one</em>:</p>
<pre><code>f x == f y &lt;=&gt; x == y</code></pre>
<p>This makes the alignment invertible, so there exists a mapping <code>f'</code> back from the second sequence to the first one. Makes sense, no? Second, the alignment is <em>monotonic</em>, satisfying:<sup><a href="#fn1" class="footnoteRef" id="fnref1">1</a></sup></p>
<pre><code>f x &gt; f y &lt;=&gt; x &gt; y</code></pre>
<p>In other words, the positions make up <em>subsequences</em> of equal lengths. So you can skip bits in either sequence, but you’re not allowed to flip things around.</p>
<p>In order to model this in Haskell, we can define a position to simply be an <code>Int</code>, and use a list of pairs of positions. So we can define a type alias for this:</p>
<pre><code>type Alignment = [(Int,Int)]</code></pre>
<p>Although this is very specific, it does feel a bit unsatisfactory from a typing perspective: the type doesn’t enforce either of the two invariants. E.g. it allows definitions like:</p>
<pre><code>wrong1 = [(1,1),(2,1)]  -- maps both a[1] and a[2] to b[1]
wrong2 = [(1,2),(2,1)]  -- maps the wrong way around</code></pre>
<p>It’s hard to come up with a good data structure that enforces the invariants (suggestions welcome!), but at least, it’s very <em>concrete</em>, and the compiler willl now protect you from accidentally substitute an alignments with a <code>String</code>, or any other type. Right?</p>
<p>Again, if you’re a dynamic typing enthusiast, you might argue that this enforcement is not such a big deal, you’re not very likely to pass strings by accident here anyway – and I agree. In fact, I think it is much worse that all the <code>Int</code>s are interchangeable; it is all too easy to, say, compare positions from different sequences, or performing arithmetic on them (what do you get when you multiply two positions?).</p>
<p>For example, we can align a sequence against another <em>via</em> a third sequence. Basically, given sequences <code>A</code>, <code>B</code>, and <code>C</code>, we have alignments <code>A</code> to <code>B</code> and <code>B</code> to <code>C</code>, and would like to derive the alignment of <code>A</code> to <code>C</code>. Now this isn’t too complicated:</p>
<pre><code>trans_align :: Alignment -&gt; Alignment -&gt; Alignment  -- type signature
trans_align ((p1,p2):ps) ((q1,q2):qs)
	| p2 == q1 = (p1,q2) : trans_align ps qs        -- return match and continue
	| p2 &lt; q1 = trans_align ps ((q1,q2):qs)         -- skip first &#39;p&#39;
	| q1 &lt; p2 = trans_align ((p1,p2):ps) qs         -- skip first &#39;q&#39;
trans_align _ _ = []                                -- one input is empty</code></pre>
<p>The first line declares the type of the function; it takes two alignments and returns a third. Then, the input alignments are <em>pattern matched</em> to extract the first pair of each alignment, and their position in sequence <code>B</code> are compared. If they are the same, we have found a position in <code>A</code> that maps to a position in <code>C</code>, if not, we skip the one which is smallest, and in any case we continue. The final line is for when the pattern match fails, i.e. one of the alignments is empty. In that case, no more mappings can be constructed, and we return the empty list.</p>
<p>But what if we make our <code>Alignment</code> type <em>less</em> specific:</p>
<pre><code>type Alignment a b = [(a,b)]</code></pre>
<p>Here, <code>a</code> and <code>b</code> are type variables, they represent any type. So e.g.</p>
<pre><code>[(&quot;foo&quot;,42),(&quot;bar&quot;,1337)]</code></pre>
<p>is a valid alignment of type <code>Alignment String Int</code>, for instance. In other words, we no longer require indices to different sequences to have the same type. We can of course give <code>trans_align</code> the type:</p>
<pre><code>trans_align :: Alignemnt Int Int -&gt; Alignemnt Int Int -&gt; Alignemnt Int Int
-- index into:            A   B                B   C                A   C</code></pre>
<p>But we’re not acutally using any <code>Int</code> operations here, and clearly, if the returned alignment is from <code>A</code> to <code>C</code>, we can enforce this by instead giving <code>trans_align</code> a type of:</p>
<pre><code>trans_align :: Alignment x y -&gt; Alignment y z -&gt; Alignment x z</code></pre>
<p>(In reality, we also need an <code>Ord</code> constraint on <code>y</code>, without this, we couldn’t compare positions in <code>B</code> in our implementation.<sup><a href="#fn2" class="footnoteRef" id="fnref2">2</a></sup>)</p>
<p>See what have we done? In the end we’ll probably just use <code>Int</code> for all the type variables (<code>x</code>, <code>y</code>, and <code>z)</code>, but the implementation of <code>trans_align</code> doesn’t <em>know</em> this. Thus the type ensures that indices in the different sequences are kept separately, and things like comparing indices from different sequences, or returning indices from the wrong sequence are no longer not just bad ideas, they’re illegal, and enforced by the compiler. How cool is that? This function is even prohibited from comparing indices in <code>A</code> or <code>C</code>, all it can do with them is return them. And: unlike passing strings off as ints, mixing up indexes is a mistake that is easy to make.</p>
<p>In retrospect, this is perhaps obvious: The more general a type is - that is, the less you know aobut its specifics - the less you are allowed to do with it. A general type <em>constrains</em> its implementation. And the less you can do with it, the fewer <em>wrong</em> things are possible. <em>Perfection is achieved not when there is nothing left to add, but when there is nothing left to take away.</em> And perfect use of the type system is when all the wrong programs are eliminated, and only correct implementations are legal.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>For some alignments, typically protein to protein, the alignment usually has to be monotonically increasing, for other cases, it can be either way.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>In fact, if we wrote <code>trans_align</code> without specifying the type, this is, modulo the <code>Alignment</code> alias, is the type Haskell will assign to it:</p>
<pre><code>*Main&gt; :t trans_align
trans_align :: Ord a =&gt; [(t, a)] -&gt; [(a, t1)] -&gt; [(t, t1)]</code></pre>
<p>So we can either design types up front, or we can let the compiler work out the most general type - where “general type” is another way of saying the least constraint that must be satisfied. Isn’t it neat?<a href="#fnref2">↩</a></p></li>
</ol>
</div>]]></description>
    <pubDate>Tue, 03 Apr 2012 15:00:00 UT</pubDate>
    <guid>http://blog.malde.org/posts/polymorphic-types-are-safer.html</guid>
</item>
<item>
    <title><em>My</em> blog software</title>
    <link>http://blog.malde.org/posts/hakyll-mods.html</link>
    <description><![CDATA[<p>Twan van Laarhoven recently <a href="http://twanvl.nl/blog/2012-03-28-blog-software">wrote</a> about his home-grown blogging software, implemented in PHP. He stands tall, making no excuses, and I must admit the site looks nice. As posted <a href="/posts/wordpress-to-hakyll.html">previously</a>, I have chosen to use <a href="http://jaspervdj.be/hakyll/">Hakyll</a> to implement this site, and I thought I’d share some recent experiences.</p>
<p>Hakyll is a templating system, so it consists of a set of HTML templates and a Haskell program that expands them and builds a set of static pages constiuting the site. There are an assortment of auxiliary files, like images and css that are just copied. Post content is written in <em>markdown</em><sup><a href="#fn1" class="footnoteRef" id="fnref1">1</a></sup>, and metadata specified at the top is available to the template processing.</p>
<p>This setup has some advantages: I can develop code or textual content on my laptop, build, and check the site by accessing a <code>file://</code> URL. I use <a href="http://darcs.net">darcs</a> version control, and when I’m happy with the result, I simply <code>darcs push</code> the changes to the server, and run <code>hakyll.hs</code> to rebuild the site. Also, the site being static, there are few holes to exploit (which were my main reason for moving off Wordpress).</p>
<p>However, although there’s a blog example included with Hakyll, it’s not entirely ready to use. There are plenty of batteries are included, but you need to fit them correctly, so to speak.</p>
<h2 id="fixing-the-rss-feed">Fixing the RSS feed</h2>
<p>In a fit of hubris, I emailed the maintainer of Planet Haskell to ask to be included. The quick and brief reply stated that my site, and in particular, its RSS feed, was not deemed adequate. So I started to implement the list of required fixes:</p>
<h3 id="dates-were-missing.">Dates were missing.</h3>
<p>By default, Hakyll assumes that posts are named according to date, and sets data variables from the file name. I don’t care much for that, so to get a reasonable date field in the RSS, I added a <code>published</code> field to each post’s metadata:</p>
<pre><code>published: 2012-01-06T15:00:00Z</code></pre>
<p>The default RSS rendering function looked like this:</p>
<pre><code>-- Render RSS feed
match  &quot;rss.xml&quot; $ route idRoute
create &quot;rss.xml&quot; $
     requireAll_ &quot;posts/*&quot; &gt;&gt;&gt; renderRss feedConfiguration</code></pre>
<p>In order to sort on the <code>published</code> field, I added a custom sort:</p>
<pre><code>match  &quot;rss.xml&quot; $ route idRoute
create &quot;rss.xml&quot; $ requireAll_ &quot;posts/*&quot;
   &gt;&gt;&gt; arr (reverse . chronological)
   &gt;&gt;&gt; renderRss feedConfiguration

chronological = sortBy $ comparing $ getField &quot;published&quot;</code></pre>
<h3 id="the-rss-file-was-missing-post-contents.">The RSS file was missing post contents.</h3>
<p>It’s a bit unclear to me what the <code>description</code> field is supposed to contain in RSS; Wordpress has this <em>“more”</em> thing, and includes stuff above it, typically an introduction, but most people seem to just include the post contents.</p>
<p>The easy way out is to include the post body, and the easy way to do that, is to just copy it into a variable. Here’s the diff:</p>
<pre><code>      match &quot;posts/*&quot; $ do
          route   $ setExtension &quot;.html&quot;
          compile $ pageCompiler
 +            &gt;&gt;&gt; arr (copyBodyToField &quot;description&quot;)
              &gt;&gt;&gt; applyTemplateCompiler &quot;templates/post.html&quot;
              &gt;&gt;&gt; applyTemplateCompiler &quot;templates/default.html&quot;
              &gt;&gt;&gt; relativizeUrlsCompiler</code></pre>
<p>Hopefully, my RSS feed will pop up in Planet H. real soon now, in the meantime, why, you’ll just have to subscribe directly, won’t you?</p>
<h2 id="other-tweaks">Other tweaks</h2>
<p>Now, since the <code>description</code> field holds the whole post contents, I changed the metadata field to <code>summary</code> (since that’s what it contained). To actually make use of this, I added this variable to <code>templates/postitem.html</code>, the template responsible for rendering postitems, that is, items when posts are listed.</p>
<pre><code> &lt;li&gt;
     &lt;a href=&quot;$url$&quot;&gt;$title$&lt;/a&gt;
     - &lt;em&gt;$date$&lt;/em&gt; - by &lt;em&gt;$author$&lt;/em&gt;
+    &lt;p class=&quot;noindent&quot;&gt;$summary$&lt;/p&gt;
 &lt;/li&gt;</code></pre>
<p>One important problem is that I’m sometimes sloppy, and I want site compilation to fail if there’s something wrong - if, say, I forgot to specify an important field. For this, I can use <code>trySetField</code>, which only evaluates its argument if the field isn’t already present. Having the value be <code>error</code> thus will halt compilation:</p>
<pre><code>match  &quot;rss.xml&quot; $ route idRoute
create &quot;rss.xml&quot; $ requireAll_ &quot;posts/*&quot;
   &gt;&gt;&gt; arr (reverse . chronological)
   &gt;&gt;&gt; arr (map (trySetField &quot;summary&quot; $ error &quot;Missing field: description&quot;))
   &gt;&gt;&gt; arr (map (trySetField &quot;published&quot; $ error &quot;Missing field: published&quot;))
   &gt;&gt;&gt; renderRss feedConfiguration</code></pre>
<p>In addition, I’ve made minor edits to template and CSS, but although I use cooler software than Twan, I think it’s evident that he has greater CSS-fu, so I’ll spare you the details…</p>
<h2 id="shoulders-of-giants-or-forever-voyaging-through-dark-seas">Shoulders of giants, or forever voyaging through dark seas?</h2>
<p>As you may have noticed, there are no forum included. My WP forum drowned in spam, so this is at least 50% intentional. I have included my email address, so if you have something interesting to say, I’ll update the post, taking it into account. Otherwise, discuss away on G+ or Reddit - that’s what they’re for.</p>
<p>All in all, I’m pretty happy with this. There <em>is</em> a certain amount of yak-shaving involved, but on the other hand: infinite flexibility and reuse of awesome code like pandoc, as well as excellent support (hint: Jasper spends all his spare time hanging around on <code>#hakyll</code>, just <em>itching</em> for you to bring forth your complaints.)</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Hakyll uses <a href="http://johnmacfarlane.net/pandoc/">pandoc</a>, so it supports a variety of formats automatically, but I’m only using markdown.<a href="#fnref1">↩</a></p></li>
</ol>
</div>]]></description>
    <pubDate>Fri, 30 Mar 2012 14:00:00 UT</pubDate>
    <guid>http://blog.malde.org/posts/hakyll-mods.html</guid>
</item>
<item>
    <title>From alignments to sequence clustering</title>
    <link>http://blog.malde.org/posts/gact.html</link>
    <description><![CDATA[<p>Sequence clustering is often (usually?) based on alignments, and for instance, my <a href="http://biohaskell.org/Applications/xsact">xsact</a> clusterer does this, and the EST clustering pipeline TGICL incorporates BLAST into its pipeline. However, it strikes me as useful to be able to run the aligner separately, both because it is usually the most time consuming part of the pipeline, and because it makes it easier to tweak the parameters.</p>
<h2 id="gact---the-general-alignment-clustering-tool">GACT - The General Alignment Clustering Tool</h2>
<p>Yes, I too can haz a nucleotide acronym! <a href="http://biohaskell.org/Applications/gact">GACT</a> reads a set of sequence alignments (currently only BLAT’s PSL format is supported, but I intend to add BLAST tabular and possibly others), and clusters the sequences that are matched to each other.</p>
<p>Currently, it supports brief output (where each line represents a cluster, and consists of the labels of the sequences in the cluster) and log output, (where the cluster lines are followed by the set of alignments forming the cluster).</p>
<h2 id="supporting-tools">Supporting tools</h2>
<p>Most aligners are rather “blind”, and in many cases only a subset of the produced alignments are useful. PSL support is implemented in the <a href="http://biohaskell.org/Libraries/biopsl">biopsl</a> library, and it comes with a couple of tools to help extract what you want.</p>
<p>For instance, for EST clustering, you want the alignments to represent <em>overlaps</em>: if the end of sequence A matches the beginning of sequence B, that’s good, but if they have a short match in the middle, you’re probably not that interested. The <a href="http://biohaskell.org/Libraries/biopsl#pslfilter">pslfilter</a> command lets you select reads based on percent identity of match, percent of query or target sequence covered, and on overhang (i.e. the shortest distance from the match to the end of a sequence).</p>
<p>In other cases, you just want to cluster by best match, there’s an app…uh, a tool for that, too, namely <a href="http://biohaskell.org/Libraries/biopsl#psluniq">psluniq</a>.</p>
<h2 id="availability">Availability</h2>
<p>Available on <a href="http://hackage.haskell.org/package/gact-0.2">Hackage</a>, or from the <a href="http://malde.org/~ketil/biohaskell/">darcs repositories</a>.</p>]]></description>
    <pubDate>Tue, 27 Mar 2012 14:00:00 UT</pubDate>
    <guid>http://blog.malde.org/posts/gact.html</guid>
</item>
<item>
    <title>Recent developments: Flower and Bamstats</title>
    <link>http://blog.malde.org/posts/flower_and_bamstats.html</link>
    <description><![CDATA[<h2 id="extracting-statistics-from-paired-alignments">Extracting statistics from paired alignments</h2>
<p>If you do Illumina sequencing, you’re probably getting paired reads, i.e. you get two reads from each molecule, one from each end. This is usually pretty reliable, but for more complicated mate pair protocols, there’s often chimeric sequences where the two members of a pair actually come from different molecules, and sometimes the protocol fails for other reasons. In any case, it might be useful to look closer at the result when you map the reads to their reference.</p>
<p>To simplify this, I’ve uploaded a small tool, <a href="http://biohaskell.org/Applications/bamstats">bamstats</a> to <a href="http://hackage.haskell.org/package/bamstats">Hackage</a>. It uses Nick Ingolia’s <code>samtools</code> wrapper to extract relevant information from BAM files, and then summarizes it. Here are the examples:</p>
<p>Insert size statistics:</p>
<pre><code>% bam stats -n 1000000 test.bam
Alignment                count     prop    mean   stdev    skew    kurt
innies                  484874  96.97%    365.3    24.1    -3.9    34.9
outies                     141   0.03%    112.8   217.3     8.0    62.8
lefties                      0   0.00%      NaN     NaN     NaN     NaN
righties                     0   0.00%      NaN     NaN     NaN     NaN
Total reads:    1000000</code></pre>
<p>Histogram of insert sizes:</p>
<pre><code>% bam hist -n 1000000 -b 6 -m 600 test.bam
Alignment                count    prop      100     200     300     400     500     &gt;
innies                  484874  96.97%        0    2204    5136  471596    5847      91
outies                     141   0.03%      138       1       0       0       0       2
lefties                      0   0.00%        0       0       0       0       0       0
righties                     0   0.00%        0       0       0       0       0       0
Total reads:    1000000</code></pre>
<p>As usual, the <code>--help</code> option will tell you about the available tunables.</p>
<h2 id="flower-update">Flower update</h2>
<p>For 454 sequences (and, I guess, for Ion torrent?), it is often informative to look at the raw data (or flow values) in the SFF files. <a href="http://biohaskell.org/Applications/Flower">Flower</a>, which now ships as a part of the <a href="http://biohaskell.org/Libraries/biosff">biosff</a> library, is one option for doing this. The output option <code>-h</code> produces a table of flow values by nucleotide. The latest release adds <code>-H</code>, which produces a table of flow values by flow position. This can e.g. be used to build empirical distributions for use with <a href="http://biohaskell.org/Applications/FlowSim">flowsim</a>.</p>
<p>One way to plot the output is to run</p>
<pre><code>flower -h flows.dat input.sff</code></pre>
<p>And then, with gnuplot do:</p>
<pre><code>gnuplot&gt; set view map
gnuplot&gt; set logscale cb
gnuplot&gt; splot &quot;flows.dat&quot; matrix with image</code></pre>]]></description>
    <pubDate>Fri, 06 Jan 2012 15:00:00 UT</pubDate>
    <guid>http://blog.malde.org/posts/flower_and_bamstats.html</guid>
</item>
<item>
    <title>The return of the blog</title>
    <link>http://blog.malde.org/posts/wordpress-to-hakyll.html</link>
    <description><![CDATA[<h2 id="the-end-of-wordpress">The end of WordPress</h2>
<p>I hope somebody, somewhere, noticed that my blog has been offline for some weeks now. After upgrading the server it runs on, I decided not to continue running the WordPress instance. There are several reasons for that, among others:</p>
<ul>
<li>I’m not terribly happy or comfortable running PHP at all,</li>
<li>I’m not happy about a system that is based on downloading large code dumps and dropping them in specific directories.</li>
<li>I was never very comfortable with the web-editing widgets.</li>
<li>My installation was hacked.</li>
</ul>
<p>I think the last points illustrates the problems with the first two.</p>
<p>Anyway, having made the decision to bid farewell, I looked for alternatives. Somebody suggested that I might not need a full content manangement system, but should look into static site generating systems, like Hakyll. This is it.</p>
<h2 id="using-hakyll">Using Hakyll</h2>
<p><a href="http://jaspervdj.be/hakyll/">Hakyll</a> is really rather simple to use: there’s a blog example included, and I just snarfed that with some modifications. There’s a main file (hakyll.hs) which directs how everything is to be constructed, a template directory, a css directory, and a posts directory. Adding posts means writing a markdown file, and stuffing it in “posts”. All under version control, of course.</p>
<h2 id="plans-for-the-future">Plans for the future</h2>
<p>I can’t, of course, simply discard all my old posts, irreplaceable treasures of our cultural heritage as they are. Currently, they’re all 503’ed, but you can find some of them in the <a href="http://web.archive.org/web/20090628183919/http://blog.malde.org/">Wayback Machine</a>. When time allows, I’ll have to bring the old MySQL instance up, and somehow extract the text in a useful format.</p>
<p>Since this site is now static, there isn’t, and won’t be, a comment field. With WP, it was all spam anyway, and we can all discuss stuff on Google+, Reddit, or wherever. What I do want to do, is to provide a “comment section” with backlinks to any mentions of my posts. Probably, I’ll mine the Apache logs for referring sites, and possibly also use Google search. We’ll see what I can come up with.</p>]]></description>
    <pubDate>Thu, 05 Jan 2012 15:00:00 UT</pubDate>
    <guid>http://blog.malde.org/posts/wordpress-to-hakyll.html</guid>
</item>
<item>
    <title>A plan for Bloom filters</title>
    <link>http://blog.malde.org/posts/a-plan-for-bloom-filters.html</link>
    <description><![CDATA[<p>Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until <a href="http://www.serpentine.com/blog/">Bryan O’Sullivan</a> posted a <a href="http://www.mail-archive.com/haskell-cafe@haskell.org/msg41876.html">message</a> to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a <a href="http://book.realworldhaskell.org/beta/bloomfilter.html">chapter in the upcoming book</a>. You can read all about Bloom filters on <a href="http://en.wikipedia.org/wiki/Bloom_filter">Wikipedia</a>, but the executive summary of it is that it is a structure similar to Data.Set. Except that it is probabilistic, and may occasionally claim a value is a member when it’s not. On the positive side, the Bloom filter is very fast, and speed is independent on the size — in other words, lookup and insert is <em>O</em>(1) where Data.Set is <em>O</em>(log <em>n</em>).</p>
<p>Comparing sequences to find similarity is a common occurrence in bioinformatics. For instance, one might want to know where a certain gene is located in the chromosome, or which sequence fragments are similar enough to originate from the same gene. To speed up searches, it is common to index sequences in questions as overlapping, substrings (<em>k</em>-tuples, <em>q</em>-grams). This index seems like an obvious target for Bloom filters – large data, time critical, some false positives anyway – but for some reason, there are <a href="http://hpcb.wustl.edu/pubs/mercuryblastn.pdf">almost</a> no such applications that use them. Until now.</p>
<h2 id="sequence-clustering">Sequence clustering</h2>
<p>Sequence clustering is a commonly used technique which is usually based on sequence similarity. I’ve written one sequence clusterer, <em>xsact</em>, which is based on blocks of exact matches. There are many others, and another example is <a href="http://bioinformatics.oxfordjournals.org/cgi/reprint/19/3/421.pdf">d2cluster</a>, which is based on occurence of fixed length words – which is right up Bloom alley, right?</p>
<p>So a straightforward way to build a Bloom filter based sequence clusterer is to represent each cluster as a set of words – stored as a Bloom filter. Now, adding a new sequence to the clusters is a simple matter of extracting the words from the sequence, identifying the cluster(s) containing a sufficient number of these words, and adding the remaining words to that cluster (or the union of the clusters, in the case multiple clusters match).</p>
<p>The interesting thing about this approach is that the whole thing becomes <em>O</em>(<em>kn</em>), for <em>k</em> clusters and data size <em>n</em>. I think all other clustering algorithms are based on sequence pairs, which makes them <em>O</em>(<em>n²</em>) – in the worst case, you need to check all pairs. (However, a straightforward similarity-based clustering will have worst-case behavior when no sequence math each other, while suffix-based methods like <em>xsact</em> will have worst case when all sequences match – so perhaps there is room for a better middle ground?)</p>
<p>Anyway – while the above looks promising, there is one snag: EST sequences can occur as a copy of the gene, or due various properties of the <a href="http://blog.malde.org/index.php/2008/05/08/cleaning-up-sequences/">manufacturing process</a>, the gene’s <em>reverse complement</em>. Thus, we need to be able to compare sequences in both directions simultaneously. This could be achieved with a slightly creative hashing function, but to keep things simple, we’ll stick with the ossified mental sweat in the provided implementation.</p>
<h2 id="index-and-search">Index and search</h2>
<p>A somewhat related area is indexing and search. Let’s say you have a bunch of DNA sequences (of for each chromosome, perhaps), and a set of ESTs, which you’ll remember are gene fragments in an unpredictable mix of forward and reverse-complement directions. Here’s the plan:</p>
<ol>
<li><p>index by building a Bloom filter containing the <em>q</em>-tuples for each chromosome</p></li>
<li><p>for each EST, look up each <em>q</em>-tuple (first forward, then rev.comp.) against the filters, and assign it to the chromosome containing the most <em>q</em>-tuples</p></li>
<li><p>align each EST against the designated chromosome using traditional methods</p></li>
</ol>
<p>As far as I can tell at this point, for word size <em>q</em>, <em>c</em> chromosomes of length <em>m</em> and <em>e</em> ESTs of lenght <em>n</em>, this should run in something like <em>O</em>(<em>qn</em>) + <em>O(qenc</em>) + <em>O</em>(<em>emn</em>), compared to just aligning directly at <em>O</em>(<em>ecmn</em>). Note that mn is the big factor here, so on a large scale, we’re reducing the total work by a factor of <em>c</em>. (Of course, in real life, <em>nm</em> would be too large, and you’d use a heuristic, subquadratic alignment step, but this is for illustration purposes. Try to be a bit generous, will you?)</p>
<h2 id="further-plans">Further plans</h2>
<p>That concludes the current plan, but there are certainly improvements that can be made, two obvious ones are</p>
<ul>
<li><p>hash function that hashes a word and its reverse complement to the same value</p></li>
<li><p>break long sequences into partially (1/3) overlapping regions to speed things up even more, as well as give more accurate placements</p></li>
<li><p>Another thing that struck me is that in my <a href="http://malde.org/~ketil/biohaskell/xml2x/">annotation tool</a>, I could probably use this as a faster way to store the set of matching proteins before extracting GO terms. Currently, the performance is limited by XML parsing, so it’s probably not worth the bother at the moment.</p></li>
</ul>
<p>More is likely to come up, but now it’s time to implement something!</p>]]></description>
    <pubDate>Thu, 31 Jul 2008 20:00:00 UT</pubDate>
    <guid>http://blog.malde.org/posts/a-plan-for-bloom-filters.html</guid>
</item>
<item>
    <title>Updates and other trivialities</title>
    <link>http://blog.malde.org/posts/updates-and-other-trivialities.html</link>
    <description><![CDATA[<p>Just some quick notes:</p>
<h2 id="hackage-submissions-updated">Hackage submissions updated</h2>
<p>There seems to have been problems with some of the bioinformatics applications on Hackage, thanks to Don S. for pointing it out. That should be fixed now by new uploads, but I’m still waiting for the automatic builds to register results. An, since you ask, it was all my fault for being sloppy with version dependencies. It’d all work with a recent biolib. Speaking of which,</p>
<h2 id="a-home-page-for-the-bioinformatics-library">A home page for the bioinformatics library</h2>
<p>I’ve finally updated the static home page for the library, it can be admired (especially if you remember what Oscar Wilde had to say about that) <a href="http://biohaskell.org/Libraries/Bio">here</a>.</p>
<h2 id="ive-discovered-hpc">I’ve discovered HPC</h2>
<p>No, not high-performance computing <a href="http://itc.conversationsnetwork.org/shows/detail3682.html">which sucks anyway</a>, but GHC’s new <a href="http://www.haskell.org/haskellwiki/Hpc">ability to do coverage profiling</a>. Adding it to the default testing procedure was just a couple of extra options. This is very cool, almost too easy, and looks pretty good. The only downside is that it exposes my sloppy attitude to testing.</p>]]></description>
    <pubDate>Thu, 31 Jul 2008 15:05:41 UT</pubDate>
    <guid>http://blog.malde.org/posts/updates-and-other-trivialities.html</guid>
</item>

    </channel> 
</rss>

