BHLog

bioinformatics and haskell

31  07 2008

The Haskell Bioinformatics Library

This is a collection of data structures and algorithms that most of the other stuff on this site depends on. Much of the functionality is stable, robust, and even well-documented, some is less so.  I largely apply the itch-scratching software development process (ISSDP), so the current feature set and its level of completeness is dependent on what I need in my other work.  If you don’t like that, I am happy to accept your patches :-)

Acquiring it

Either:

or:

  • Download the tarball from Hackage

Building it

To install, you need to acquire a working GHC (possibly other Haskell system).  You also need the following external libraries:

  • QuickCheck — for unit tests
  • binary — mainly for dealing with the TwoBit sequence format
  • tagsoup — for parsing XML output from Blast
  • Parsec — for dealing with various file formats

You should be able to get what you need from http://hackage.haskell.org/.

You can then build with ‘make’, doing either ‘make install’ if you can sudo, or ‘make user_install’ if you can not.  Of course, the Makefile just proxies for the regular Cabal routine, which will work just as well:

runhaskell Setup configure
runhaskell Setup build
sudo runhaskell Setup install

(Use –prefix=$HOME to the configure step, and remove the sudo from the install step, if you don’t want to install as root.)

Using it

The best tutorial is probably looking at my other code — not much, I know. In particular, the cluster_tools package contain a bunch of small, self-contained utilities that can provide a starting point. Apart from that, there’s the code itself, and the Haddock documentation. And of course I’ll try to answer any questions, and help out any way I can.   Current list of features includes:

Sequence data

Supporting protein and nucleotide sequences and conversion between them, quality data, reading and writing FASTA and FastQ formatted files, reading TwoBit and PHD formats.

Alignments

Rudimentary support for doing alignments - including dynamic adjustment of scores based on sequence quality - and BLAST output parsing.

Support for reading ACE files, as output from e.g., CAP3 and Phrap.

Partly implemented single linkage clustering, and multiple alignment.

Annotation information

In addition to BLASTX ouput, there is support for Gene Ontology (GO) data, and KEGG.

Cabal badgeQC badgeHPC badge

One Response to “The Haskell Bioinformatics Library”

  1. […] to be different. Could it be bloomfilter not being a good consumer for the generated words?  The bio library messing up FASTA parsing?  Something else […]

Leave a Reply

You must be logged in to post a comment.