The Haskell Bioinformatics Library
This is a collection of data structures and algorithms that most of the other stuff on this site depends on. Much of the functionality is stable, robust, and even well-documented, some is less so. I largely apply the itch-scratching software development process (ISSDP), so the current feature set and its level of completeness is dependent on what I need in my other work. If you don’t like that, I am happy to accept your patches
Acquiring it
Either:
- Get it from — or browse — the darcs repository: darcs get http://malde.org/~ketil/biohaskell/biolib/
or:
- Download the tarball from Hackage
Building it
The easiest way to acquire this – or any other – Haskell library, is probably to use cabal-install. Get it up and working, and a simple ‘cabal install bio’ should get the latest version from hackage, including all dependencies, compile everything and install it in your home directory (~/.cabal/bin, I think)
To install manually, you need to acquire a working GHC (or possibly another Haskell system). You also need the following external libraries:
- QuickCheck — for unit tests
- binary — mainly for dealing with the TwoBit sequence format
- tagsoup — for parsing XML output from Blast
- Parsec — for dealing with various file formats
You should be able to get what you need from http://hackage.haskell.org/.
You can then build with ‘make’, doing either ‘make install’ if you can sudo, or ‘make user_install’ if you can not. Of course, the Makefile just proxies for the regular Cabal routine, which will work just as well:
runhaskell Setup configure
runhaskell Setup build
sudo runhaskell Setup install
(Use --prefix=$HOME to the configure step, and remove the sudo from the install step, if you don’t want to install as root.)
Using it
The best tutorial is probably looking at my other code — not much, I know. In particular, the cluster_tools package contain a bunch of small, self-contained utilities that can provide a starting point. Apart from that, there’s the code itself, and the Haddock documentation. And of course I’ll try to answer any questions, and help out any way I can. Current list of features includes:
Sequence data
Supporting protein and nucleotide sequences and conversion between them, quality data, reading and writing FASTA and FastQ formatted files, reading TwoBit and PHD formats.
Alignments
Rudimentary support for doing alignments – including dynamic adjustment of scores based on sequence quality – and BLAST, Bowtie and BLAT output parsing.
Support for reading ACE files, as output from e.g., CAP3 and Phrap.
Partly implemented single linkage clustering, and multiple alignment.
Annotation information
In addition to BLASTX ouput, there is support for Gene Ontology (GO) data, and KEGG.
[...] to be different. Could it be bloomfilter not being a good consumer for the generated words? The bio library messing up FASTA parsing? Something else [...]
While trying to install your bio library I run iinto the following circular dependency problem:
cabal install bio
Resolving dependencies…
cabal: cannot configure tagsoup-0.8. It requires QuickCheck ==2.1.*
For the dependency on QuickCheck ==2.1.* there are these packages:
QuickCheck-2.1, QuickCheck-2.1.0.1, QuickCheck-2.1.0.2 and QuickCheck-2.1.0.3.
However none of them are available.
QuickCheck-2.1 was excluded because bio-0.4 requires QuickCheck <2
Is there anything to be done to fix this?
j131
The second way of installing it -ea by getting source through darcs and manually installing dependencies also do not work:
Bio/Alignment/BlastXML.hs:41:12:
`Tag’ is not applied to enough type arguments
Expected kind `*’, but `Tag’ has kind `* -> *’
In the type signature for `getFrom’:
getFrom :: [Tag] -> String -> String
Could you please perhaps state clearly which combination of the complier/libraries is recommended?
I’ll look into this, but I think something changed in tagsoup between 0.6 and 0.8, and sloppiness on dependency versioning on my part makes this break. This is how ‘./Setup.hs configure -v’ on my system sees the world:
Configuring bio-0.4.4...
Dependency QuickCheck >=2: using QuickCheck-2.1.0.2
Dependency array -any: using array-0.3.0.0
Dependency base ==4.*: using base-4.2.0.0
Dependency binary -any: using binary-0.5.0.2
Dependency bytestring >=0.9.1: using bytestring-0.9.1.5
Dependency containers -any: using containers-0.3.0.0
Dependency mtl -any: using mtl-1.1.0.2
Dependency old-time -any: using old-time-1.0.0.3
Dependency parallel -any: using parallel-1.1.0.1
Dependency parsec -any: using parsec-2.1.0.1
Dependency random -any: using random-1.0.0.2
Dependency tagsoup >=0.4: using tagsoup-0.6
Hi j131,
you need to download and to install tagsoup 0.4
download from here:
http://hackage.haskell.org/package/tagsoup-0.4
then, you need to install using:
runhaskell Setup configure or
runhaskell Setup configure –user
runhaskell Setup build
runhaskell Setup install
so, download bio-0.4 haskell tar source package and untar package.
backup and edit bio.cabal following line:
Build-Depends: base>=3 && =1.2.0.0, binary, tagsoup= 0.9.1, containers, array,
parallel, parsec, random, old-time, mtl
I changed QuickCheck to >=1.2 and tagsoup =1.2.0.0: using QuickCheck-2.1.0.3
Dependency tagsoup <=0.4: using tagsoup-0.4
ok, good then install biohaskell lib using:
runhaskell Setup configure –user
runhaskell Setup build
opps so, you will see and error
[18 of 43] Compiling Bio.Sequence.TwoBit ( Bio/Sequence/TwoBit.hs, dist/build/Bio/Sequence/TwoBit.o )
Bio/Sequence/TwoBit.hs:37:31:
Module `Test.QuickCheck' does not export `check'
to fix error open Bio/Sequence/TwoBit.hs haskell source file and locate line 37 with your vi or emacs editor
by default you will see:
import Test.QuickCheck hiding (check) — QC 1.0
–import Test.QuickCheck hiding ((.&.)) — QC 2.0
please make changes like this:
– import Test.QuickCheck hiding (check) — QC 1.0
import Test.QuickCheck hiding ((.&.)) — QC 2.0
this is because I am using QuickCheck 2
ok, good , try to build once again :
runhaskell Setup build
opps error
[35 of 43] Compiling Bio.Util.TestBase ( Bio/Util/TestBase.hs, dist/build/Bio/Util/TestBase.o )
Bio/Util/TestBase.hs:81:4:
`coarbitrary' is not a (visible) method of class `Arbitrary'
Bio/Util/TestBase.hs:85:4:
`coarbitrary' is not a (visible) method of class `Arbitrary'
Bio/Util/TestBase.hs:90:4:
`coarbitrary' is not a (visible) method of class `Arbitrary'
Bio/Util/TestBase.hs:98:4:
`coarbitrary' is not a (visible) method of class `Arbitrary'
Bio/Util/TestBase.hs:105:4:
`coarbitrary' is not a (visible) method of class `Arbitrary'
Bio/Util/TestBase.hs:109:4:
`coarbitrary' is not a (visible) method of class `Arbitrary'
Bio/Util/TestBase.hs:117:4:
`coarbitrary' is not a (visible) method of class `Arbitrary'
Bio/Util/TestBase.hs:125:4:
`coarbitrary' is not a (visible) method of class `Arbitrary'
Bio/Util/TestBase.hs:132:4:
`coarbitrary' is not a (visible) method of class `Arbitrary'
you will need to edit bio.cabal, find Bio.Util.TestBase and delete it then save file.
runhaskell Setup install
Installing library in $HOME/.cabal/lib/bio-0.4/ghc-6.12.1
Registering bio-0.4…
ls $HOME/.cabal/lib/bio-0.4/ghc-6.12.1/
Bio HSbio-0.4.o libHSbio-0.4.a
ready, that's all
ketil, what's is your opinion ?
Ciao
Hi hackob, and thanks for the extensive walk-through.
There are two issues you are running into:
1) the interface changed between tagsoup 0.6 and 0.8
2) QuickCheck 2 is different from QC 1
In the current biolib (i.e. my darcs repo), the dependencies look like:
| Build-Depends: base>=4 && <5, QuickCheck>=2, binary==0.4.*, tagsoup>=0.4 && <0.8, bytestring >= 0.9.1,
| containers, array, parallel, parsec, random, old-time, mtl
I have also received a patch for biolib to work with both old and new tagsoup, which would obliviate the need for the upper limit to tagsoup versions.
I think the biggest problem here is that I haven’t kept up on Hackage, the 0.4 version is quite old. I’ve rolled out a new tarball (bio-0.4.4 at Hackage). You can usually get updated sources from “darcs get http://malde.org/biohaskell/biolib”
-k