A set of tools for working with 454 sequences
Pyrosequencing is often referred to as next-generation sequencing (although it would be increasingly more accurate to refer to traditional Sanger sequencing as previous-generation sequencing) as it produces large amounts of sequences at lower costs. As the technology is radically different, so are the type of data that results from it, and while it is possible to use many of the same software tools for working with new sequences, there is a clear need for specific ones as well.
The Haskell bioinformatics library has for some time now supported reading and writing the SFF format, which is used by the oldest (previousest?) of the next generation technologies, namely Roche’s 454 sequencing. Once the library functionality is in place, it is easy to develop small tools for doing the various chores. After spending some time in anticipation of the hordes of programmers no doubt rushing to exploit the monumental effort put I down in the library, I’ve instead written a few programs myself, including tools information/statistics extraction (flower), extracting sequences by various criteria (fselect), simulating sequencing (pyrosim), and repairing broken SFF files (frepair). This is their story.
The bioinformatics library includes functions and data structures to read and write and extract data from SFF files. It provides reasonable performance (meaning that in most cases, my disk is the limiting factor). It is also a clean room implementation, built from official documentation, but not based on Roche’s (or anybody else, for that matter) code. It comes with an LGPL license, which I hope will make it useful while encouraging back contributions.
Flower
This program can extract various information from SFF files. To some extent, it overlaps with ’sffinfo’ and ’sfffile’ from 454, but in addition to generating Fasta sequences (optionally with quality), it can also extract directly to the more compact FastQ format. It also can output a (huge) table of flow values, the histogram of flow values (useful for estimating the flow distributions, and thus probabilities for the base calls), or a one-line summary of each read that includes some statistics on lenghts and quality. Here’s the usage info:
flower: Usage: flower -[f|q|r|R] <file.sff> [<file2.sff> ..]
-r output reads in Fasta format
-R output reads in Fasta format with associated .qual
(generates files instead of writing to <stdout>)
-q output in FastQ format
-f output the flowgram in tabular format
-h output a histogram table of flow values
-s output a summary of each read
FSelect
FSelect takes an SFF file and produces a new SFF file containing a subset of the sequences, using the same statistics as Flower can output. It has a small expression language built in, so that you can build more complex logical queries. For instance, if you want to extract the sequences with lenght between 300 and 400, and with a K² score greater than 0.7, you could do
fselect "And (Func LT k2 0.7) (And (Func GT len 200) (Func LT len 400))" FL61AHU01.sff
Okay, it’s a bit clunky, but the syntax should be reasonably straightforward: Logical operators are And, Not, and Or, while Func op f v defines a function using op (either LT or GT) to compare f (one of k2, len, tlen, ncount) to each read. Output is generated in a file named “selected.sff”.
FSelect can also select random reads, using the select expression "Rand p", where p is the probability for selection. (I.e. fselect "Rand 0.2" FL61AHU01.sff will select each read with a probability of 0.20, giving you approximately 20% of the reads. Random selection can not be combined with other criteria at this point, if you want this, you’ll have to run fselect multiple times.
FRecover
We had some issues with broken SFF files, specifically there were block of zero bytes at random places. Both flower and sffinfo just terminate on encountering a broken read, so I implemented a simple utility that attempts to skip the broken block and continue extracting good reads beyond the trouble.
PyroSim
This attempts to simulate pyrosequencing, for now it only does the GS20 generation of 454 sequencing, but the other generations should be easy to add. The main problem is that the algorithm for quality calling is insufficiently documented. GS20 has the advantage that quality for a homopolymer is uniquely derived from the flow value (modulo rounding), so reverse engineering it is fairly straightforward.
Anyway, pyrosim takes a ‘generation’ (at the moment, this is only GS20) and a Fasta file as input parameters, picks random points in the Fasta file, and produces the correspondig flowgram, including a suitable perturbation of the values to introduce the expected measure of noise, calls the bases and quality, and outputs an SFF file.
Availability
Flower, FSelect and FRecover are part of the “flower” package, available from the darcs archive at http://malde.org/~ketil/biohaskell/flower
PyroSim is available separately (for now, at least) from the darcs archive at http://malde.org/~ketil/biohaskell/pyrosim
I try to upload what I consider stable versions to HackageDB, please check the Bioinformatics category there. Currently, these programs are in a bit of flux, so going with the darcs repo is probably a good idea at this point.
I’d really like to have packages for the most common Linux distributions as well (i.e. .debs and .rpms), but I don’t know the details of how to produce them, and while I’ve made half-hearted attempts in the past, I guess I just don’t really desire it enough. I’d be happy to see somebody package it up, so if you know how, please go ahead.
Leave a Reply
You must be logged in to post a comment.