19
02
2010
I recently did a brief
presentation of the set of tools I’ve developed for analyzing pyrosequences (the Roche 454 variety). Nothing spectacular, just an overview of various ways of slicing and dicing SFF files using tools written in Haskell. For lack of a better place to put it, I’ll drop my slides below.
flowers
14
12
2009
I’m currently involved in a project where we study, among other things, the 3′UTR and poly-A tails of certain genes. For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn’t find any program or tool to do just that. Presumably the task is considered too trivial? So, like many other “trivial” tasks, it is performed by ad-hoc solutions that are likely to be suboptimal.
Here’s a better method that identifies poly-A tails by finding an optimal, quality adjusted alignment in linear time.
Read more…
7
10
2009
Ubuntu 9.10, nicknamed Karmic Koala, is about to be released, and in a moment of idleness, I upgraded my old 9.04 install to the latest beta. Upgrading is always generates a slight feeling of dread, taking the plunge from the cozy stability of bugs I’ve learned to work around, into the great unknown, but it all went even smoother than the previous one. And on the plus side, ghc is now, finally, upgraded to an almost modern release, (6.10.4) and lots of libraries are included as well. Great work by Joachim Breitner and his army of debianizers. So I’m all ready to take advantage of my new compiler and its improvements, but first I need to bring all my software up to speed. I’ll make notes here as I go along, and hopefully this will be useful also for users of other Linux distributions.
Read more…
15
09
2009
I was recently in Trondheim, and got an opportunity to present Haskell to an audience of bioinformaticians. Alas, it is hard to describe Haskell in all its glory to the uninitiated in forty-five minutes, and especially when I also wanted to talk a bit about the application to bioinformatics. I left in the belief that I managed to communicate some of the ideas, and submit the slides and other material here for posterity. (I’m happy to receive comments, too, just in case I’ll do a revised version of the talk later on).
31
08
2009
A recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers. Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the quality data that often accompanies Fasta sequence files. When investigating one of my programs that seemed a bit on the slow side, I discovered that although parsing Fasta sequence files (with readFasta from Bio.Sequence.Fasta) processes at a rate in excess of 200MB/s on my laptop, reading sequences with quality data (using readFastaQual) was much slower, about 2-3 MB/s. After some investigation and a few rewrites, it’s up to about 15MB/s, but still pretty far from plain sequence. Below are three (and a half) different versions, and the hope somebody can improve on it even furter.
Read more…
3
07
2009
Pyrosequencing is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional Sanger sequencing as previous-generation sequencing) as it produces large amounts of sequences at lower costs. As the technology is radically different, so are the type of data that results from it, and while it is possible to use many of the same software tools for working with new sequences, there is a clear need for specific ones as well.
The Haskell bioinformatics library has for some time now supported reading and writing the SFF format, which is used by the oldest (previousest?) of the next generation technologies, namely Roche’s 454 sequencing. Once the library functionality is in place, it is easy to develop small tools for doing the various chores. After spending some time in anticipation of the hordes of programmers no doubt rushing to exploit the monumental effort put I down in the library, I’ve instead written a few programs myself, including tools information/statistics extraction (flower), extracting sequences by various criteria (fselect), simulating sequencing (pyrosim), and repairing broken SFF files (frepair). This is their story. Read more…
16
06
2009
Dephd is a small application for performing various analysis of nucleotide sequences. Originally, it was used for analyzing/converting PHD-file output from the basecaller phred, but it has since grown a bit beyond that. A new update was just pushed onto HackageDB, this is just a quick note describing new features. Read more…
14
05
2009
Until now (version 0.3.5) of the bioinformatics library, the Sequnce data type has essentially been a wrapper around a couple of strings, with only the most rudimentary and generic structure. This has the advantage that you can easily work with different kinds of sequences without caring about the particulars, but of course, nothing stops you from comparing a nucleotide sequence to a protein letter by letter. We’d like some more safety without sacrificing flexibility, and by using phantom types we can get this. Below is my attempt at implementing this.
Read more…
13
03
2009
In my vacation, I experimented with phantom types for the Sequence data type. Bascially, we want nucleotide and protein sequences to have the same representation, and mostly use the same algorithms, but sometimes we need to distinguish them, so as not to inadvertently treat a protein as a nucleotide sequence. A more detailed writeup is in the works, but currently, I’ve pushed the darcs repo to http://malde.org/~ketil/biohaskell/biolib-phantom/ so if everything works out, this will be the next release (0.4). (Note to self: we now have a stable and a development branch. Almost like a serious and all grown up software project. Professionality – Yay!)
Also, since short reads are all the rage, and my flower program appears to be used a bit, I’ve done a quick writeup of its features and use as a static page. I’ll try to keep it updated as things progress. Popularity – Yay!
Finally, I got some help compiling everything on some less mainstream operating systems (”Windows”, I think it is called). Mostly, things appear to work, and some improvements – albeit portability-neutral ones – were made. Portability – Yay!.
11
02
2009
Time flees. It’s already been a while since PADL in Savannah, where I had the opportunity to enjoy talks in topics I mostly managed to follow and meet interesting and interested people. Thanks to the organizers and committees making it all possible. I presented a paper on Bloom filters that Bryan O’Sullivan and I wrote, and thought I’d make the paper available along with the slides (expanded somewhat), and a couple of ideas for extending Bloom filters that I think are original (or “novel”, as they say in science).
Read more…