The Haskell Bioinformatics Library
This is a collection of data structures and algorithms that most of the other stuff on this site depends on. Much of the functionality is stable, robust, and even well-documented, some is less so. I largely apply the itch-scratching software development process (ISSDP), so the current feature set and its level of completeness is dependent on what I need in my other work. If you don’t like that, I am happy to accept your patches
Acquiring it
Either:
- Get it from — or browse — the darcs repository: darcs get http://malde.org/~ketil/biohaskell/biolib/
or:
- Download the tarball from Hackage
Building it
To install, you need to acquire a working GHC (possibly other Haskell system). You also need the following external libraries:
- QuickCheck — for unit tests
- binary — mainly for dealing with the TwoBit sequence format
- tagsoup — for parsing XML output from Blast
- Parsec — for dealing with various file formats
You should be able to get what you need from http://hackage.haskell.org/.
You can then build with ‘make’, doing either ‘make install’ if you can sudo, or ‘make user_install’ if you can not. Of course, the Makefile just proxies for the regular Cabal routine, which will work just as well:
runhaskell Setup configure
runhaskell Setup build
sudo runhaskell Setup install
(Use –prefix=$HOME to the configure step, and remove the sudo from the install step, if you don’t want to install as root.)
Using it
The best tutorial is probably looking at my other code — not much, I know. In particular, the cluster_tools package contain a bunch of small, self-contained utilities that can provide a starting point. Apart from that, there’s the code itself, and the Haddock documentation. And of course I’ll try to answer any questions, and help out any way I can. Current list of features includes:
Sequence data
Supporting protein and nucleotide sequences and conversion between them, quality data, reading and writing FASTA and FastQ formatted files, reading TwoBit and PHD formats.
Alignments
Rudimentary support for doing alignments - including dynamic adjustment of scores based on sequence quality - and BLAST output parsing.
Support for reading ACE files, as output from e.g., CAP3 and Phrap.
Partly implemented single linkage clustering, and multiple alignment.
Annotation information
In addition to BLASTX ouput, there is support for Gene Ontology (GO) data, and KEGG.
[…] to be different. Could it be bloomfilter not being a good consumer for the generated words? The bio library messing up FASTA parsing? Something else […]