Flower – analysing 454 flowgram files
Flower is a simple tool for working with 454 sequences. As you may know, 454 is (still) new and fashionable technology, able to produce much larger amounts of sequence data than the old and venerable Sanger sequencing we all know and love. The core of a 454 sequence is the flowgram, really a series of light intensities caused by the pyrosequencing reactions. The Bio library can read these SFF files, and flower is just a simple interface to the library, able to produce a variety of outputs. It’s in continuous development, so any copy you got from somewhere else than the darcs repository is likely outdate.
Here’s the usage, as printed by flower itself:
flower: Usage: flower -[f|q|r|R] <file.sff> [<file2.sff> ..]
-r output reads in Fasta format
-R output reads in Fasta format with associated .qual
(generates files instead of writing to <stdout>)
-q output in FastQ format
-f output the flowgram in tabular format
-h output a histogram table of flow values
-s output a summary of each read
Let’s look at the various options in detail.
The -r option simply outputs, to stdout, the sequences in the traditional Fasta format. This actually ignores the flowgram data, using the sequence data stored in the SFF file. Nice if you want to BLAST or do other analysis on the sequences.
With -R, you get the Fasta sequence, and also the associated quality information. The -R option is the only one that doesn’t output to stdout, since it needs to generate two files for each input file. If the input is “foo.sff“, the files generated are “foo.sff.fasta” and “foo.sff.qual“. I’ll leave it to you to find out which contains what data.
The -q option outputs the FastQ format, which contains both sequence and quality information in the same file. This is often more convenient than working with separate files, and I suspect it will be a more popular approach in the future.
The -f option outputs the flowgram, one flow value per line, so expect the output to be sizable. This is actually the raison d’être for flower, since we use this to analyze the quality of the reads.
With -h, you get a histogram of flow values. If you read up on 454 sequencing, you’ve probably seen the curves with peaks around each integral value, and valleys between. If not, connect this to gnuplot, and you can make your own.
Finally, -s gives a summary of each read, with name, length, and an overall quality metric. Here’s an example output:
# name........ date...... time.... reg trim_l trim_r x_loc y_loc len K2 trimK2 ncount avgQ travgQ f:sgnt f:q20 FL61AHU01A9CLR 2008-12-18 17:24:07 01 5 48 399 1661 340 0.63 0.74 0 18.92 33.25 341 319 FL61AHU01AO2VZ 2008-12-18 17:24:07 01 5 192 168 2125 321 0.56 0.66 0 22.09 34.19 231 193 FL61AHU01DK0AY 2008-12-18 17:24:07 01 5 96 1352 1336 357 0.62 0.73 0 16.52 35.13 263 225 FL61AHU01DF4AV 2008-12-18 17:24:07 01 5 59 1296 2613 255 0.63 0.76 0 16.83 32.58 234 64 FL61AHU01BC4F2 2008-12-18 17:24:07 01 5 125 442 1584 249 0.65 0.62 0 21.25 27.85 249 150 FL61AHU01DAJ6X 2008-12-18 17:24:07 01 5 195 1233 1319 298 0.69 0.67 1 24.98 38.25 196 199
The columns are, in sequence: the name of each read, then date, time, region as extracted from the read name (or question marks if this extraction fails), the trimming positions as encoded in the SFF file, location of the read on the sequencing plate (also encoded in the name), full lenght of the read. Then follows the “K-squared” measure for untrimmed and trimmed read, the count of N base calls, and average called quality, again for both the untrimmed and the trimmed read. The two final columns are produced by my current development version (not published yet), and are the values of two of the 454 trimming filters.
More options
There’s also a “secret” option, -v, that outputs the number of reads as they are processed. Since SFFs are quite large — typically containg hundreds of thousands reads, and processing (on my 1.3GHz laptop) at about 800 reads/second — running time can be a while, and it can be nice to see that something is happening. Also, this makes it easier to pinpoint any troublesome sequence that causes flower to abort execution. (I’ve seen some SFF files containing blocks of zeros, for instance).
[...] program appears to be used a bit, I’ve done a quick writeup of its features and use as a static page. I’ll try to keep it updated as things progress. Popularity – [...]