Biohaskell

bioinformatics and haskell

20  07 2010

Updated software versions available

I’ve just uploaded new versions of various software to HackageDB.  If you have cabal-install on your system, it should now be possible to do e.g. cabal install flowsim to automatically download and compile the program and its dependencies.

bio-0.4.6: A bioinformatics libraryImage of sunflowers in a miliary cooking utensil thingy

The library provides functions and data structures to work with various kinds of bioinformatics data.  It has its own page here.  New features include support for BLAT’s PSL format, fixes to SFF and limiting the binary dependency to <0.5 to maintain necessary laziness.

flowsim-0.2.6: A simulator for 454 pyrosequencing data

Flowsim is new on Hackage, but also has its own page. The current version is split into two parts, clonesim which simulates fragmenting of a genome, and flowsim proper, which simulates flowgrams from the sequences, and does base- and quality calling in order to output SFF files, similar to those generated from the real thing.  Flowsim will be presented at ECCB’10 in Ghent at the end of September.

flower-0.3: A set of tools to work with pyrosequencing data

This is a – dare I say bouquet?  of various tools for working on and with 454 data.  Flower itself extracts various information from SFF files, and is documented here, but the package includes other tools as well, namely: the quite renamable flowt which attempts to remove duplicate clones, an artifact that appears to occur frequently in these data; the more appropriately named frename which relabels reads uniquely (useful if downstream software requires it), and frecover which recovers corrupted SFF files, which happened to us once, but so far hasn’t happened again.

They should all now be a cabal install away, so if you use these, please let me know how you fare, and whether you find them useful!  I also try to provide Linux binaries for recent versions, and hope to provide proper Linux distribution packages (you know, .debs and .rpms) in the future. If you want to help out, there’s also the darcs repositories for biolib, flower, and flowsim.


07 2010

A quick count of popular Haskell libraries in Debian and Ubuntu

Don Stewart recently posted a summary of Hackage downloads, which can be seen as a metric of a package’s popularity.  Of course, there are many other ways to acquire libraries and applications, some may prefer to get the bleeding edge right from the source repository, others prefer the comfort of their distribution’s repositories.  For the latter, Debian (and by extension, Ubuntu) has a popularity contest package, that inspects the system, and reports back the list of installed packages and their status.  The data are readily available, so I downloaded the summary from Debian and Ubuntu. Grepping for libghc6- .*-dev gives this:

Read more…


22  05 2010

Snagged!

Recently, I’ve been burned by a couple of, eh, issues.  Not exactly bugs, but some hidden surprises that have taken some work to iron out.  Below I’ll make a quick writeup of symptoms, diagnoses, and remedies, in the hope that other people running into the same problems will find it useful.

Read more…


19  02 2010

Tools for pyrosequencing analysis

I recently did a brief presentation of the set of tools I’ve developed for analyzing pyrosequences (the Roche 454 variety).  Nothing spectacular, just an overview of various ways of slicing and dicing SFF files using tools written in Haskell.  For lack of a better place to put it, I’ll drop my slides below.

flowers


14  12 2009

Searching for poly(A) tails

I’m currently involved in a project where we study, among other things, the 3′UTR and poly-A tails of certain genes. For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn’t find any program or tool to do just that. Presumably the task is considered too trivial? So, like many other “trivial” tasks, it is performed by ad-hoc solutions that are likely to be suboptimal.

Here’s a better method that identifies poly-A tails by finding an optimal, quality adjusted alignment in linear time.

Read more…


10 2009

Installing the software on Ubuntu 9.10

Ubuntu 9.10, nicknamed Karmic Koala, is about to be released, and in a moment of idleness, I upgraded my old 9.04 install to the latest beta.  Upgrading is always generates a slight feeling of dread,  taking the plunge from the cozy stability of bugs I’ve learned to work around, into the great unknown, but it all went even smoother than the previous one.  And on the plus side, ghc is now, finally, upgraded to an almost modern release, (6.10.4) and lots of libraries are included as well.  Great work by Joachim Breitner and his army of debianizers.  So I’m all ready to take advantage of my new compiler and its improvements, but first I need to bring all my software up to speed.  I’ll make notes here as I go along, and hopefully this will be useful also for users of other Linux distributions.

Read more…


15  09 2009

A (too) brief Biohaskell presentation

I was recently in Trondheim, and got an opportunity to present Haskell to an audience of bioinformaticians.  Alas, it is hard to describe Haskell in all its glory to the uninitiated in forty-five minutes, and especially when I also wanted to talk a bit about the application to bioinformatics.  I left in the belief that I managed to communicate some of the ideas, and submit the slides and other material here for posterity.  (I’m happy to receive comments, too, just in case I’ll do a revised version of the talk later on).


31  08 2009

Parsing ints

Artocarpus integerA recurring theme on the Haskell mailing lists is how to quicly parse a file consisting of integers.  Often, this comes up in the contest of benchmarking, but a real example of integer-filled files are the quality data that often accompanies Fasta sequence files.  When investigating one of my programs that seemed a bit on the slow side, I discovered that although parsing Fasta sequence files (with readFasta from Bio.Sequence.Fasta) processes at a rate in excess of 200MB/s on my laptop, reading sequences with quality data (using readFastaQual) was much slower, about 2-3 MB/s.   After some investigation and a few rewrites, it’s up to about 15MB/s, but still pretty far from plain sequence.  Below are three (and a half) different versions, and the hope somebody can improve on it even furter.

Read more…


07 2009

A set of tools for working with 454 sequences

Random flower courtesy of wikimediaPyrosequencing is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional Sanger sequencing as previous-generation sequencing) as it produces large amounts of sequences at lower costs.  As the technology is radically different, so are the type of data that results from it, and while it is possible to use many of the same software tools for working with new sequences, there is a clear need for specific ones as well.

The Haskell bioinformatics library has for some time now supported reading and writing the SFF format, which is used by the oldest (previousest?) of the next generation technologies, namely Roche’s 454 sequencing.  Once the library functionality is in place, it is easy to develop small tools for doing the various chores.   After spending some time in anticipation of the hordes of programmers no doubt rushing to exploit the monumental effort put I down in  the library, I’ve instead written a few programs myself, including tools information/statistics extraction (flower),  extracting sequences by various criteria (fselect), simulating sequencing (pyrosim), and repairing broken SFF files (frepair).  This is their story. Read more…


16  06 2009

Dephd updates

Dephd is a small application for performing various analysis of nucleotide sequences.  Originally, it was used for analyzing/converting PHD-file output from the basecaller phred, but it has since grown a bit beyond that.  A new update was just pushed onto HackageDB, this is just a quick note describing new features. Read more…


Next Page »