Biohaskell

bioinformatics and haskell

23  02 2009

Flower – analysing 454 flowgram files

Flower is a simple tool for working with 454 sequences.  As you may know, 454 is (still) new and fashionable  technology, able to produce much larger amounts of sequence data than the old and venerable Sanger sequencing we all know and love.  The core of a 454 sequence is the flowgram, really a series of light intensities caused by the pyrosequencing reactions.  The Bio library can read these SFF files, and flower is just a simple interface to the library, able to produce a variety of outputs.  It’s in continuous development, so any copy you got from somewhere else than the darcs repository is likely outdate.

Here’s the usage, as printed by flower itself:

 flower: Usage: flower -[f|q|r|R] <file.sff> [<file2.sff> ..]
        -r  output reads in Fasta format
        -R  output reads in Fasta format with associated .qual
            (generates files instead of writing to <stdout>)
        -q  output in FastQ format
        -f  output the flowgram in tabular format
        -h  output a histogram table of flow values
        -s  output a summary of each read

Let’s look at the various options in detail.

The -r option simply outputs, to stdout, the sequences in the traditional Fasta format. This actually ignores the flowgram data, using the sequence data stored in the SFF file.  Nice if you want to BLAST or do other analysis on the sequences.

With -R, you get the Fasta sequence, and also the associated quality information.  The -R option is the only one that doesn’t output to stdout, since it needs to generate two files for each input file.  If the input is “foo.sff“, the files generated are “foo.sff.fasta” and “foo.sff.qual“.  I’ll leave it to you to find out which contains what data.

The -q option outputs the FastQ format, which contains both sequence and quality information in the same file.  This is often more convenient than working with separate files, and I suspect it will be a more popular approach in the future.

The -f option outputs the flowgram, one flow value per line, so expect the output to be sizable.  This is actually the raison d’être for flower, since we use this to analyze the quality of the reads.

With -h, you get a histogram of flow values.  If you read up on 454 sequencing, you’ve probably seen the curves with peaks around each integral value, and valleys between.  If not, connect this to gnuplot, and you can make your own.

Finally, -s gives a summary of each read, with name, length, and an overall quality metric.  Here’s an example output:

# name........  date......      time....        reg     trim_l  trim_r  x_loc   y_loc   len     K2      trimK2  ncount  avgQ    travgQ  f:sgnt  f:q20
FL61AHU01A9CLR  2008-12-18      17:24:07        01      5       48      399     1661    340     0.63    0.74    0       18.92   33.25   341     319
FL61AHU01AO2VZ  2008-12-18      17:24:07        01      5       192     168     2125    321     0.56    0.66    0       22.09   34.19   231     193
FL61AHU01DK0AY  2008-12-18      17:24:07        01      5       96      1352    1336    357     0.62    0.73    0       16.52   35.13   263     225
FL61AHU01DF4AV  2008-12-18      17:24:07        01      5       59      1296    2613    255     0.63    0.76    0       16.83   32.58   234     64
FL61AHU01BC4F2  2008-12-18      17:24:07        01      5       125     442     1584    249     0.65    0.62    0       21.25   27.85   249     150
FL61AHU01DAJ6X  2008-12-18      17:24:07        01      5       195     1233    1319    298     0.69    0.67    1       24.98   38.25   196     199

The columns are, in sequence: the name of each read, then date, time, region as extracted from the read name (or question marks if this extraction fails), the trimming positions as encoded in the SFF file, location of the read on the sequencing plate (also encoded in the name), full lenght of the read.  Then follows the “K-squared” measure for untrimmed and trimmed read, the count of N base calls, and average called quality, again for both the untrimmed and the trimmed read.  The two final columns are produced by my current development version (not published yet), and are the values of two of the 454 trimming filters.

More options

There’s also a “secret” option, -v, that outputs the number of reads as they are processed.  Since SFFs are quite large — typically containg hundreds of thousands reads, and processing (on my 1.3GHz laptop) at about 800 reads/second — running time can be a while, and it can be nice to see that something is happening.  Also, this makes it easier to pinpoint any troublesome sequence that causes flower to abort execution.  (I’ve seen some SFF files containing blocks of zeros, for instance).

One Response to “Flower – analysing 454 flowgram files”

  1. [...] program appears to be used a bit, I’ve done a quick writeup of its features and use as a static page.  I’ll try to keep it updated as things progress.  Popularity – [...]