3
07
2009
Pyrosequencing is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional Sanger sequencing as previous-generation sequencing) as it produces large amounts of sequences at lower costs. As the technology is radically different, so are the type of data that results from it, and while it is possible to use many of the same software tools for working with new sequences, there is a clear need for specific ones as well.
The Haskell bioinformatics library has for some time now supported reading and writing the SFF format, which is used by the oldest (previousest?) of the next generation technologies, namely Roche’s 454 sequencing. Once the library functionality is in place, it is easy to develop small tools for doing the various chores. After spending some time in anticipation of the hordes of programmers no doubt rushing to exploit the monumental effort put I down in the library, I’ve instead written a few programs myself, including tools information/statistics extraction (flower), extracting sequences by various criteria (fselect), simulating sequencing (pyrosim), and repairing broken SFF files (frepair). This is their story. Read more…
16
06
2009
Dephd is a small application for performing various analysis of nucleotide sequences. Originally, it was used for analyzing/converting PHD-file output from the basecaller phred, but it has since grown a bit beyond that. A new update was just pushed onto HackageDB, this is just a quick note describing new features. Read more…
14
05
2009
Until now (version 0.3.5) of the bioinformatics library, the Sequnce data type has essentially been a wrapper around a couple of strings, with only the most rudimentary and generic structure. This has the advantage that you can easily work with different kinds of sequences without caring about the particulars, but of course, nothing stops you from comparing a nucleotide sequence to a protein letter by letter. We’d like some more safety without sacrificing flexibility, and by using phantom types we can get this. Below is my attempt at implementing this.
Read more…
13
03
2009
In my vacation, I experimented with phantom types for the Sequence data type. Bascially, we want nucleotide and protein sequences to have the same representation, and mostly use the same algorithms, but sometimes we need to distinguish them, so as not to inadvertently treat a protein as a nucleotide sequence. A more detailed writeup is in the works, but currently, I’ve pushed the darcs repo to http://malde.org/~ketil/biohaskell/biolib-phantom/ so if everything works out, this will be the next release (0.4). (Note to self: we now have a stable and a development branch. Almost like a serious and all grown up software project. Professionality - Yay!)
Also, since short reads are all the rage, and my flower program appears to be used a bit, I’ve done a quick writeup of its features and use as a static page. I’ll try to keep it updated as things progress. Popularity - Yay!
Finally, I got some help compiling everything on some less mainstream operating systems (”Windows”, I think it is called). Mostly, things appear to work, and some improvements - albeit portability-neutral ones - were made. Portability - Yay!.
11
02
2009
Time flees. It’s already been a while since PADL in Savannah, where I had the opportunity to enjoy talks in topics I mostly managed to follow and meet interesting and interested people. Thanks to the organizers and committees making it all possible. I presented a paper on Bloom filters that Bryan O’Sullivan and I wrote, and thought I’d make the paper available along with the slides (expanded somewhat), and a couple of ideas for extending Bloom filters that I think are original (or “novel”, as they say in science).
Read more…
14
11
2008
Roche’s 454 sequencing technology can produce biological sequence data on a scale that exceeds traditional Sanger sequencing by orders of magnitude. Due to the fundamentally different method used to generate the sequences, we would like to investigate the raw data and see if we can quantify — and maybe also reduce the number or severity of — errors. This means reading the binary SFF format. Below, we’ll dissect the SFF format, and describe a Haskell implementation.
Read more…
24
10
2008
I’ve been spending the last couple of weeks working on an indexing scheme for sequences, using Bryan O’Sullivan’s Bloom filters. Now, it turned out that when Bryan tested out the code, he found a curious problem: Apparently, the indexing stage scaled quadratically with sequence length. This wouldn’t have been so strange, were it not for the fact that I saw the expected linear time usage when I ran the code. Some more digging about revealed that my laptop also showed quadratic scaling. Profiling showed the culprit to be a simple pipeline-style function:
swords s = take (fromIntegral (seqlength s)+1-k) . map (B.take k) . B.tails . B.map toUpper $ seqdata s
Here, seqdata returns a lazy bytestring, which is also what’s hiding behind the B. qualifier. Basically, this just builds the list of all lenght-k substrings. This should, in my opinion, stream nicely, and run in constant space and linear time. What on earth makes this quadratic? On only some systems? Time to dissect the systems in question:
Read more…
9
09
2008
It was just brought to my attention that people have started to use a new file format for sequences. This format, called ‘FastQ’ combines both the sequence data itself and the quality data in one file. That’s a nice idea, and I implemented support for it, tests, docs and all, in the bio library. Runs fast, too. Basically, the format is a sequence of records, each one similar to this:
Read more…
14
08
2008
In my copious spare time, I’ve started putting together a tutorial for using the biohaskell library. It’ll probably take some time — anything from a while to an eternity — until it’s complete, but I thought I’d follow the adage of “release early, release often” in the hope that the intermediate product may prove useful to somebody, somewhere. It shouldn’t really require any prior knowledge of Haskell nor of biology. Anyway, it’s here, please take a look, and tell me what you think!
In other news, inspired by an article on programmer productivity at LWN, I ran darcs-graph on my code to see how I do. I guess I consider myself an about average programmer, and it looks like I can average about five commits a day when I’m working on a project. I’m occasionally touching 20 commits, but that’s probably a built-up backlog of patches. Let’s see where this puts me:
Read more…
31
07
2008
Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until Bryan O’Sullivan posted a message to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a chapter in the upcoming book. You can read all about Bloom filters on Wikipedia, but the executive summary of it is that it is a structure similar to Data.Set. Except that it is probabilistic, and may occasionally claim a value is a member when it’s not. On the positive side, the Bloom filter is very fast, and speed is independent on the size — in other words, lookup and insert is O(1) where Data.Set is O(log n).
Comparing sequences to find similarity is a common occurrence in bioinformatics. For instance, one might want to know where a certain gene is located in the chromosome, or which sequence fragments are similar enough to originate from the same gene. To speed up searches, it is common to index sequences in questions as overlapping, substrings (k-tuples, q-grams). This index seems like an obvious target for Bloom filters — large data, time critical, some false positives anyway — but for some reason, there is almost no such applications that use them. Until now.
Read more…