Searching for poly(A) tails

by Ketil Malde; December 14, 2009

I’m currently involved in a project where we study, among other things, the 3’UTR and poly-A tails of certain genes. For this, is of course important to accurately identify the poly-A tail in each transcript, but I couldn’t find any program or tool to do just that. Presumably the task is considered too trivial? So, like many other “trivial” tasks, it is performed by ad-hoc solutions that are likely to be suboptimal.

Here’s a better method that identifies poly-A tails by finding an optimal, quality adjusted alignment in linear time.

A quick introduction

Although the definitions of what constitutes a gene vary considerably, we’ll use the term to refer to a region of DNA that get transcribed, that is, copied from DNA into an mRNA molecule, which in turn will be used as a blueprint for assembling a protein. After transcription, the mRNA molecule then undergoes polyadenylation, a process where a string of adenines (the ‘A’ of the nucleotide alphabet) gets appended to an mRNA molecule before it is exported from the nucleus.

Identifying poly-A tails are important for several reasons.

It positively identifies the end of the transcript. If you don’t have a poly-A in your sequence, you have no way to know how far the molecule extends beyond the end of the sequence. You can also find alternatively terminated transcripts this way.
It positively identifies the end of the transcript. Anything after the poly-A tail is linker or vector sequence, and can safely be trimmed off. Even if, as is often the case, it is too low quality to be recognized by your average vector masking software.
It provides useful information about the transcript, as the poly-A tail is important for things like protecting the mRNA from degradation.

Unfortunately utilities often trim poly-A tails by default (e.g. SeqClean), or just ignore it (e.g. BLAST’s low-complexity filter).

Quality based alignment

When a molecule is sequenced, the analog output from the sequencing machine is stored as a chromatogram. In order to be useful, the sequence is called, that is, translated to a string of letters from the familiar {A,C,G,T} nucleotides alphabet. In addition, the base caller will associate each letter with a quality value. This is derived from an estimate of the probability of the call being incorrect, and for quality value Q, the error probability estimate is

ϵ = 10^-Q/10

Traditionally sequence alignment simply aligns the string of characters using a fixed positive score (reward) for aligning similar characters, and fixed negative scores (penalties) for either substituting a different character, opening a gap, or extending a previous gap.

However, taking into account the quality value, we can do better, and instead of fixed scores, we can adjust the scores dynamically according to quality.

Using this method, the penalty for e.g. aligning two different characters will depend on the quality of the characters: high quality means a high penalty, low quality – lower penalty (since there’s a greater chance one of them was incorrectly called).

Scoring of alignments

When calculating the score of an alignment, we really want to answer the question: how likely is this sequence to be a real poly-A sequence, as opposed to just a random string? In other words, we are comparing our sequence against two models: the poly-A model, and the background model. Our score will use the ratio of probabilities of the string being produced by the two models.

For the poly-A model, only As are allowed, so the probability of a character occurring is 1 for As and 0 for the others. For the background model, we’ll just take a uniform distribution of nucleotides, each getting a probability of 0.25.

Using this scheme, the score for a string s is simply 1/0.25 = 4 for each A, and 0/0.25 = 0 for all others. We usually work with the logarithm of these numbers to make them more manageable.

The optimal alignment is then simply the longest run of As, since as soon as you multiply with a zero (or add -infinity, if you use log-scores), you lose the whole score.

Adding quality to the mix

Of course, the actual sequence isn’t perfect, and even the poly-A tail is likely to contain the odd G, C, or T. To determine exactly how likely is where the quality value enters the picture. Using the formula above, we can calculate the error estimate and decide what the penalty for a mismatch and reward for a match should be.

For the poly-A model, the probability for a match (that is, an actual ‘A’ in the sequence) is 1-ϵ, the probability of a mismatch (a non-A) is ϵ/3 (since only one of the three possible substititutions is an A, and for simplicity, we give them equal probability). Using the formula for ϵ as a function of Q (and hopefully not introducing any errors), I get the scores to be:

match q = log (4*(1-1/10**(q/10)))
mismatch q = log 4 - log 3 - q/10*log 10

Now, we can use this to do a standard Smith-Waterman alignment, calculating a dynamic programming matrix, and searching for an optimal local alignment.

However, since we’re aligning against a repeated nucleotide, there’s no real need for a second dimension, and we can use the following recurrence to calculate the “polyA-score” M for each position i:

M_i = max(0, S_i + M_i − 1)

To implement this, we first define the list of scores by applying match and mismatch to the list of (nucleotide,quality) pairs. We also define a scanl-based function to calculate a list of cumulative scores:

scores = map (\(c,q) -> if toUpper c=='A' then match q else mismatch q) qd
cumulative = scanl (\a b -> let r = a + b in max 0 r)

The only remaining thing is to identify the maximal value which marks the end of the poly-A tail, and the corresponding 0 value that indicates the start. I wrote a recursive function called findmax for this, but a better programmer will probably be able to do this with a fold.

Including the parts discussed briefly above, the whole thing looks like this:

findPolyA :: Sequence Nuc -&gt; Maybe (Int,Int)
findPolyA (Seq _ d mq) =
    let qd = zip (B.unpack d) (maybe (repeat 15) BB.unpack mq)
        scores = map (\(c,q) -> if toUpper c=='A' then match q else mismatch q) qd
        match x' = let x = fromIntegral x' in log (4*(1-1/10**(x/10)))
        mismatch x' = let x = fromIntegral x' in log 4 - log 3 - x/10*log 10
        cumulative = scanl (\a b -> let r = a + b in max 0 r) 0
        (zi,mi,maxscore) = findmax $ cumulative scores
    in if maxscore > 12 then Just (zi+1,mi) else Nothing  -- arbitrary constant alert!

findmax :: [Double] -&gt; (Int,Int,Double)
findmax = go 0 (0,0,0) . zip [0..]
    where go _ cm [] = cm
          go _ cm ((i,0):rest) = go i cm rest
          go last_z (cmz,cmi,cmx) ((i,x):rest) = if x &gt; cmx then go last_z (last_z,i,x) rest
                                                 else go last_z (cmz,cmi,cmx) rest

Availability

This method is implemented in a simple tool called “trimpolya” (darcs repo), and also in the more general “dephd” (darcs, hackage) sequence analysis package.

comments powered by Disqus

Feedback? Please email ketil@malde.org.