Cleaning up sequences

by Ketil Malde; May 8, 2008

The first challenge when dealing with sequence data is removing vector and contaminants and other undesirable stuff. I’ve been somewhat unhappy with the current state of my EST pipeline, and investigated more closely what is going on. First, let’s review

The EST sequencing process

  1. mRNA is extracted from a sample, and a primer-linker is attached. Since mRNAs contain a poly-A tail (a string of As at the end), the primer contains a string of Ts that will hybridize to the poly-A tail, and the short double-stranded end is ready to initiate the duplication to cDNA.

  2. Reverse transcriptase makes a reverse-complemented cDNA copy of the mRNA. This process will go on for some length, but not necessarily to the end of the mRNA transcript - the resulting cDNA strand will thus often start some distance from the beginning of the mRNA.

  3. the RNA is removed, and the cDNA methylated to make it more stable.

  4. polymerase duplicates the cDNA strand, making it double stranded.

  5. adapters are attached to each end of the double-stranded DNA segment. These are short sequences that are designed to match one of the vector’s(below) cloning sites, fixing the DNA segment as an insert in the vector.

  6. the primer-linker contains a target for a restriction enzyme. This enzyme now chops off part of the primer-linker, discarding the adapter at that end, and revealing the sequence that will ligate (bind) to the other cloning site in the vector.

  7. The vector is a short, circular DNA sequence (plasmid or phagemid) that will work like a small chromosome when inserted into a suitable host, typically a bacteria like E. coli. Our DNA segment is now ligated to the vector at both ends, and the whole thing is duplicated by growing a bacteria colony.

At this point, there has been a number of opportunities for mishaps. The primer-linker can ligate to something else than a poly-A tail (1), reverse transcription can be - and often is - cut short (2), cDNA can be incompletely methylated, making it a possible target for restriction enzymes that will chop it up (3), adapters can ligate with each other, forming chimeric sequences (5), the primer-linker can avoid being cut, leading to it being retained in the sequence (6), and the bacteria can assimilate two vectors, or none, the insert can end up in the wrong direction, possibly a vector can acquire an insert from bacterial DNA (7), and there’s probably more.

Sequences and consequences

What this means, is that in addition to actual mRNA sequence, we will usually get sequences containing vector, and quite often primer, linker, adapter, or E.coli sequence as well. This needs to be identified, and discarded (masked or trimmed off) before further analysis.

I’ve been experimenting with the following programs, SeqClean by TIGR and Lucy, by Chou and Holmes (2001).


Pros: Masks against a database of vectors and a database of contaminants. Masks low complexity.

Cons: Does not take into account sequence quality. Inherent model expects vector in the first and second third of the sequence, sometimes failing to trim vector, and sometimes trimming valid sequence. Trims sequences instead of marking them, making it harder to compare results.


Pros: Uses knowledge of vector splice sites, giving it the opportunity for more accurate masking and differentiation between vector and spurious matches. Takes quality into account.

Cons: Very picky about vector input, and in particular splice site sequences must be in the correct orientation and sequence. A bug makes it fail on long Fasta headers (although I wrote a fix for that). Can only deal with a single vector at a time, meaning you need to know for each sequence the vector used.

I’ve noticed that even after masking with either program, some vector remains. And, it turns out, I’m not the only one, my buddies at CBU has the had the same problem, and there’s also a paper (Chen et al, 2006). Unfortunately, the remedy suggested there didn’t work for our sequences, and in fact, only seems to help for one particular vector type - that we don’t use.

Currently, I use a rather ugly hack, using RBR to mask, first against E.coli contamination with high stringency, and then against vector, using a more lenient settings. Finally, I use SeqClean to chop off the masked-out bits, as RBR only Xes out the offending parts, but doesn’t trim, and TGICL doesn’t deal gracefully with masked but untrimmed sequences…but that’s another bug.

The plan?

It should be relatively easy to write a tool that combines the strengths of Lucy and SeqClean, while avoiding the weaknesses. A useful tool could use BLAST to find all vectors, know about vector-adapter-linker-primer combinations, and identify them. It could pinpoint sequences that do not follow the typical vector-linker-sequence-poly-A-linker-vector pattern, and perhaps identify chimeric sequences and other artifacts.

It could do a lot of things. Unfortunately, at the moment the margin of my calendar is too small to fit it in…

comments powered by Disqus
Feedback? Please email