Recent developments - Flower and Bamstats

by Ketil; January 6, 2012

Extracting statistics from paired alignments

If you do Illumina sequencing, you’re probably getting paired reads, i.e. you get two reads from each molecule, one from each end. This is usually pretty reliable, but for more complicated mate pair protocols, there’s often chimeric sequences where the two members of a pair actually come from different molecules, and sometimes the protocol fails for other reasons. In any case, it might be useful to look closer at the result when you map the reads to their reference.

To simplify this, I’ve uploaded a small tool, bamstats to Hackage. It uses Nick Ingolia’s samtools wrapper to extract relevant information from BAM files, and then summarizes it. Here are the examples:

Insert size statistics:

  % bam stats -n 1000000 test.bam
  Alignment                count     prop    mean   stdev    skew    kurt
  innies                  484874  96.97%    365.3    24.1    -3.9    34.9
  outies                     141   0.03%    112.8   217.3     8.0    62.8
  lefties                      0   0.00%      NaN     NaN     NaN     NaN
  righties                     0   0.00%      NaN     NaN     NaN     NaN
  Total reads:    1000000

Histogram of insert sizes:

  % bam hist -n 1000000 -b 6 -m 600 test.bam
  Alignment                count    prop      100     200     300     400     500     >
  innies                  484874  96.97%        0    2204    5136  471596    5847      91
  outies                     141   0.03%      138       1       0       0       0       2
  lefties                      0   0.00%        0       0       0       0       0       0
  righties                     0   0.00%        0       0       0       0       0       0
  Total reads:    1000000

As usual, the --help option will tell you about the available tunables.

Flower update

For 454 sequences (and, I guess, for Ion torrent?), it is often informative to look at the raw data (or flow values) in the SFF files. Flower, which now ships as a part of the biosff library, is one option for doing this. The output option -h produces a table of flow values by nucleotide. The latest release adds -H, which produces a table of flow values by flow position. This can e.g. be used to build empirical distributions for use with flowsim.

One way to plot the output is to run

flower -h flows.dat input.sff

And then, with gnuplot do:

gnuplot> set view map
gnuplot> set logscale cb
gnuplot> splot "flows.dat" matrix with image

comments powered by Disqus

Feedback? Please email ketil@malde.org.