From alignments to sequence clustering

by Ketil; March 27, 2012

Sequence clustering is often (usually?) based on alignments, and for instance, my xsact clusterer does this, and the EST clustering pipeline TGICL incorporates BLAST into its pipeline. However, it strikes me as useful to be able to run the aligner separately, both because it is usually the most time consuming part of the pipeline, and because it makes it easier to tweak the parameters.

GACT - The General Alignment Clustering Tool

Yes, I too can haz a nucleotide acronym! GACT reads a set of sequence alignments (currently only BLAT’s PSL format is supported, but I intend to add BLAST tabular and possibly others), and clusters the sequences that are matched to each other.

Currently, it supports brief output (where each line represents a cluster, and consists of the labels of the sequences in the cluster) and log output, (where the cluster lines are followed by the set of alignments forming the cluster).

Supporting tools

Most aligners are rather “blind”, and in many cases only a subset of the produced alignments are useful. PSL support is implemented in the biopsl library, and it comes with a couple of tools to help extract what you want.

For instance, for EST clustering, you want the alignments to represent overlaps: if the end of sequence A matches the beginning of sequence B, that’s good, but if they have a short match in the middle, you’re probably not that interested. The pslfilter command lets you select reads based on percent identity of match, percent of query or target sequence covered, and on overhang (i.e. the shortest distance from the match to the end of a sequence).

In other cases, you just want to cluster by best match, there’s an app…uh, a tool for that, too, namely psluniq.

Availability

Available on Hackage, or from the darcs repositories.

comments powered by Disqus
Feedback? Please email ketil@malde.org.