BHLog

bioinformatics and haskell

09 2008

The FastQ file format for sequences

It was just brought to my attention that people have started to use a new file format for sequences.  This format, called ‘FastQ’ combines both the sequence data itself and the quality data in one file.  That’s a nice idea, and I implemented support for it, tests, docs and all, in the bio library.  Runs fast, too.  Basically, the format is a sequence of records, each one similar to this:
Read more…


14  08 2008

The wee beginnings of a biohaskell tutorial? — and some thoughts on programming productivity

In my copious spare time, I’ve started putting together a tutorial for using the biohaskell library.  It’ll probably take some time — anything from a while to an eternity — until it’s complete, but I thought I’d follow the adage of “release early, release often” in the hope that the intermediate product may prove useful to somebody, somewhere.  It shouldn’t really require any prior knowledge of Haskell nor of biology.  Anyway, it’s here, please take a look, and tell me what you think!

In other news, inspired by an article on programmer productivity at LWN, I ran darcs-graph on my code to see how I do.  I guess I consider myself an about average programmer, and it looks like I can average about five commits a day when I’m working on a project.  I’m occasionally touching 20 commits, but that’s probably a built-up backlog of patches.  Let’s see where this puts me:

Read more…


31  07 2008

A plan for Bloom filters

Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until Bryan O’Sullivan posted a message to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a chapter in the upcoming book.  You can read all about Bloom filters on Wikipedia, but the executive summary of it is that it is a structure similar to Data.Set.  Except that it is probabilistic, and may occasionally claim a value is a member when it’s not.  On the positive side, the Bloom filter is very fast, and speed is independent on the size — in other words, lookup and insert is O(1) where Data.Set is O(log n).

Comparing sequences to find similarity is a common occurrence in bioinformatics.  For instance, one might want to know where a certain gene is located in the chromosome, or which sequence fragments are similar enough to originate from the same gene. To speed up searches, it is common to index sequences in questions as overlapping, substrings (k-tuples, q-grams).  This index seems like an obvious target for Bloom filters — large data, time critical, some false positives anyway — but for some reason, there is almost no such applications that use them. Until now.

Read more…


31  07 2008

Updates and other trivialities

Just some quick notes:

Hackage submissions updated

There seems to have been problems with some of the bioinformatics applications on Hackage, thanks to Don S. for pointing it out. That should be fixed now by new uploads, but I’m still waiting for the automatic builds to register results. An, since you ask, it was all my fault for being sloppy with version dependencies. It’d all work with a recent biolib. Speaking of which,

A home page for the bioinformatics library

I’ve finally updated the static home page for the library, it can be admired (especially if you remember what Oscar Wilde had to say about that) here.

I’ve discovered HPC

No, not high-performance computing (which sucks anyway), but GHC’s new ability to do coverage profiling.  Adding it to the default testing procedure was just a couple of extra options, the results can be admired by browsing the HTML files in the darcs repository.  This is very cool, almost too easy, and looks pretty good. The only downside is that it exposes my sloppy attitude to testing.


11  07 2008

Functional bash: bracketing

Functional programming expands your mindMy current development project is an EST pipeline.  For various reasons, it is implemented in shell — bash, to be exact.  In other words, the pipeline is a script, or rather a set of scripts, that will tie together the various stages: masking, clustering, assembly, and annotation.

As in any program, there are many occasions where you want to effect some particular change during some part of the program.  The archetypical example is allocation of local variables. After allocation, the variables are then available to the program until they run out of scope, they then get deallocated automatically.  The technique can be generalized beyond this.  For instance, you (or rather I) may want to set a $STAGE variable that indicates the current processing stage, and which should be unset when the stage has finished executing.  Or you may want to run some processing in a different directory, in which case you really want to remember to return to the previous directory when you finish.  The purpose of bracketing is to wrap a section of code with an initial part to be run in advance, and a final part to be run afterwards.

Read more…


27  05 2008

Giving users what they want: Haskell scripts on the web with CGI

Minke whale (source: Wikimedia)

As a consequence of IWC policies, the Institute of Marine Research is required to store genetic identification of each minke whale that is hunted. This of course means that people will come to me for help in bridging the gap between test tubes and the databases by providing some analysis tools that can extract the information from the data.

Well, I’m not complaining, work security and all that, but it does leave me with a slight problem. I’m a Linux user, and my experience with non-Unix OSes after I passed on my Amiga 500 to my little brother is rather limited. (It’s not that I won’t fix your PC, rather, it’s quite likely that I can’t.) But the users I support in this particular case are largely confined by the walls of Redmond, and not terribly enthusiastic about broadening their horizons in that regard. Here’s one way to solve it. Read more…


18  05 2008

Optimization week: making Haskell go faster

It seems to be optimization week on the haskell café mailing list. Inspired by a combination of Don Stewart’s blog post about how to optimize for speed and the sorry performance of my xml2x program, I thought this would be good time to see if things could be improved. In its current state, xml2x takes a few hours to process a few days worth of BLASTX output, so it’s far from critical, but faster is always better, and reading Don’s post, the intermediate output from the compiler, a.k.a. ghc core didn’t really look so scary. Read more…


05 2008

Can you spare five minutes?

…to write a simple, but useful and efficient bioinformatics program? Here’s how to build a simple tool to extract a clustering from an ACE file, using functionality in the bioinformatics library. It is about three lines of actual code, it is fast, and it is efficient.

Read more…


05 2008

Cleaning up sequences

The first challenge when dealing with sequence data is removing vector and contaminants and other undesirable stuff. I’ve been somewhat unhappy with the current state of my EST pipeline, and investigated more closely what is going on. Read more…


25  04 2008

Welcome to the biohaskell blog!