BHLog

bioinformatics and haskell

Optimization again: befuddled by bytestrings

I’ve been spending the last couple of weeks working on an indexing scheme for sequences, using Bryan O’Sullivan’s Bloom filters.  Now, it turned out that when Bryan tested out the code, he found a curious problem:  Apparently, the indexing stage scaled quadratically with sequence length.  This wouldn’t have been so strange, were it not for […]

The FastQ file format for sequences

It was just brought to my attention that people have started to use a new file format for sequences.  This format, called ‘FastQ’ combines both the sequence data itself and the quality data in one file.  That’s a nice idea, and I implemented support for it, tests, docs and all, in the bio library.  Runs […]

A plan for Bloom filters

Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until Bryan O’Sullivan posted a message to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a chapter in the upcoming book.  You can read all about Bloom filters on […]

Functional bash: bracketing

My current development project is an EST pipeline.  For various reasons, it is implemented in shell — bash, to be exact.  In other words, the pipeline is a script, or rather a set of scripts, that will tie together the various stages: masking, clustering, assembly, and annotation.
As in any program, there are many occasions where […]

Cleaning up sequences

The first challenge when dealing with sequence data is removing vector and contaminants and other undesirable stuff. I’ve been somewhat unhappy with the current state of my EST pipeline, and investigated more closely what is going on.