Biohaskell

bioinformatics and haskell

Searching for poly(A) tails

I’m currently involved in a project where we study, among other things, the 3′UTR and poly-A tails of certain genes. For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn’t find any program or tool to do just that. Presumably the task is considered too [...]

Dephd updates

Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the basecaller phred, but it has since grown a bit beyond that.  A new update was just pushed onto HackageDB, this is just a quick note describing new features.

Optimization again: befuddled by bytestrings

I’ve been spending the last couple of weeks working on an indexing scheme for sequences, using Bryan O’Sullivan’s Bloom filters.  Now, it turned out that when Bryan tested out the code, he found a curious problem:  Apparently, the indexing stage scaled quadratically with sequence length.  This wouldn’t have been so strange, were it not for [...]

The FastQ file format for sequences

It was just brought to my attention that people have started to use a new file format for sequences.  This format, called ‘FastQ’ combines both the sequence data itself and the quality data in one file.  That’s a nice idea, and I implemented support for it, tests, docs and all, in the bio library.  Runs [...]

A plan for Bloom filters

Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until Bryan O’Sullivan posted a message to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a chapter in the upcoming book.  You can read all about Bloom filters on [...]

Functional bash: bracketing

My current development project is an EST pipeline.  For various reasons, it is implemented in shell — bash, to be exact.  In other words, the pipeline is a script, or rather a set of scripts, that will tie together the various stages: masking, clustering, assembly, and annotation.
As in any program, there are many occasions where [...]

Cleaning up sequences

The first challenge when dealing with sequence data is removing vector and contaminants and other undesirable stuff. I’ve been somewhat unhappy with the current state of my EST pipeline, and investigated more closely what is going on.