Can you spare five minutes?

by Ketil Malde; May 9, 2008

…to write a simple, but useful and efficient bioinformatics program? Here’s how to build a simple tool to extract a clustering from an ACE file, using functionality in the bioinformatics library. It is about three lines of actual code, it is fast, and it is efficient.

Background

Sequences are often just fragments of the real sequence (gene transcript or chromosome or what have you), so they need to be assembled. Assembly usually involves creating an overlap graph and calculating a Hamiltonian path through it - i.e. it is an NP-complete problem. In short, fast or accurate, pick one. In order to reduce the complexity, this is done in a two-stage process, first a clustering just grouping sequences with enough similarity, and then the real assembly, which merges sequences into assembled contigs. This allows you to use the more accurate, but slower CAP3, instead of the fast-and-loose Phrap (Liang et al, 2000).

Now, TGICL which I’ve used a bit, does both clustering and assembly, and reports the clustering in a separate file. This tells you which cluster each sequence belongs to, but often the assembler decides that two sequences are similar, but yet belong in different assemblies. I would therefore like to know which sequences comprise each assembly; in other words, I want the more fine-grained clustering provided by the more accurate assembler.

CAP3 produces a more or less industry-standard ACE file as part of its output. This file contains a lot of information about the assembly, including contigs, contig quality, sequences (called reads) comprising each contig, which pieces of each read is being used, and which piece is discarded - and so on. Crucial to us at this point is only the connection between each contig and its set of reads.

The Program

…looks like this:

module Main where
import Bio.Alignment.ACE
import Bio.Sequence
import System.SimpleArgs (getArgs)

The heavy lifting is done by the Bio.Alignment.ACE module, and I’m using my SimpleArgs module to, well, simplify argument handling and improve error reporting. Now for the meat of the program:

main :: IO ()
main = getArgs >>= readACE >>= putStrLn . unlines . map (unlines . map decluster)
decluster :: Assembly -> String
decluster a = ">"++(label $ fst $ contig a)++"n"
        ++ unwords (map (label . thrd) $ fragments a)
    where
        thrd (_,_,s,_) = s
	    label = toStr . seqlabel
	

The main function is implemented as a simple pipeline (the >>= corresponds roughly to the Unix pipe), and the slightly ugly repeated maps and unlines is due to readACE returning [[Assembly]] — since an ACE file may contain multiple assemblies. The decluster function just formats output, each contig name is prefixed with a ‘>’, and the next line contains the name of sequences comprising the contig.

Edit: The use of the two unlines cause blank lines between the set of contigs from each cluster. While there is something to be said for this format aesthetically, it wreaks havoc with clusterlibs, another program in the cluster-tools package. Therefore, the current code in the repository now has replaced the unliness with concats and instead adds a “\n” to the second output line as well as the first. Also, the putStrLn is replaced by putStr to avoid an empty line at the end of the output.

Testing it in practice

As a brief test, I ran it on the latest assembly of our Gadus morhua sequences, also known as cod. Ignoring singletons, about 16 000 sequences are assembled into about 30Mbytes worth of ACE files. Processing this takes almost 3 seconds on a 2.3GHz Core2 machine. Compared to the rest of the processing pipeline, which takes about 300 CPU-hours, I can live with this.

The ACE parser (in the library) uses Parsec, but reads the data as Lazy ByteStrings, and uses a separate tokenizer. I believe this is a lot faster than letting Parsec do the tokenizing on the inefficient Strings (list-of-Chars), and while the ACE format may be considered non-trivial, it’s not very complicated either. Can you spare five minutes?

comments powered by Disqus
Feedback? Please email ketil@malde.org.