Illumina RNAseq and bias

by Ketil; May 6, 2013

After doing mostly genome sequences for some time, we recently looked at some transcript sequences, specifically RNASeq data using Illumina sequencing platform. One of the first questions will be whether the data look okay, and one of the first things to check to answer it is the distribution of nucleotides, using the FastX toolkit. It looks like this (ripped off the ’net, from University of Exeter):

RNAseq bias

Does this looks scary to you? Since the sequence fragment should be randomly distributed in the transcripts, there should be no (or at least very little) bias, but the distribution of the first twelve nucleotides is clearly skewed.

We checked this for 454 RNASeq data as well, and yes, the same thing happens there. On the other hand, genome data does not have anything of the sort, here everything aligns nicely to the expected AT/GC ratio.

After some searching around, we found the explanation in a paper

The protocol used for RNA sequencing is to use reverse transcriptase to convert it back into DNA, which can then be sequenced as usual. The RNA is single stranded, and the transcriptase, like polymerase, needs a small double-stranded segment which it then extends. This is achieved by mixing in random hexamers - small single stranded DNA stretches six nucleotides long - which will then bind to the RNA at random places. As far as the theory goes, at least.

But it turns out, priming by random hexamers doesn’t amount to randomly located transcripts, and there is a bias. The main message of the paper is solid enough, and it seems clear that this bias is not caused by read errors, but rather by a location bias for the hexamers.

The article makes a convincing argument, and I’m almost entirely, but not quite, satisfied with it. What still makes me uneasy about the whole thing is:

The pattern stretches up to position 12, or perhaps 13. Conceivably, this could be caused by two biased hexamers ligating after each other, but then the pattern should be the same in position 7-12, and it isn’t really. (Pearson correlation between the first and second hexamer: 0.11.) Why would there be different biases here?
Also, if there is a double hexamer effect, I’d be much happier if I observed a weak triple hexamer effect - but from position 13 onwards, it’s almost dead flat; no discernible bias.
~~The same pattern appear for the second reads in paired end sequencing. I can accept a hexamer bias in the beginning of the reads, but how could this possibly affect the end of the cloned DNA?~~ Edit: I guess we use random hexamer primers for the next round of amplification (in the reverse direction), and then sort clones by length (on a gel or similar). Doh. Oh: if there is just a tiny bias, wouldn’t that be amplified with each round of PCR? Would that explain it?
I can’t figure out the hexamers involved. The common hexamers are, unsurprisingly, repeats. You would think that such a distinct and consistent pattern would be caused by a few very common hexamers, but that doesn’t appear to be the case.

A brief tour of the evidence

So I extracted the initial 12-mer from the reads in a file 28pre12.seq, as well as the following 12-mer in 28inf12.seq, to be used as a control. First, counting what should be the most common three-letter pattern:

% grep -c AT...A 28???12.seq 
28inf12.seq:15649
28pre12.seq:14544

No dice - this is slightly more common in the infix sequences than in the prefix ones.

Counting reads by nucleotide in position 7:

% for a in A C G T; do echo $a `grep -c "^......$a" 28pre12.seq`       
for> done
A 16887
C 19624
G 15239
T 48250

Here, T is vastly overrepresented. The control:

% for a in A C G T; do echo $a `grep -c "^......$a" 28inf12.seq`; done                                                                                   
A 29518
C 20374
G 20894
T 29206

Looking at the 12-mers with a T in position 7 at the beginning of reads, and in using the middle of reads as the control, we get:

% grep '^......T' 28pre12.seq | sort | uniq -c | sort -n | tail
 27 C A C G A A T A T C T T
 27 C A G C A C T A T A T T
 30 T G A A T A T A T A C A
 31 C A G G A A T C A A A C
 31 C G C T G T T A T C C C
 34 C A A A A T T G T A T A
 39 T T T T G C T A C A C A
 50 C G C T G A T A T A C T
 57 C G A A T A T C T T T T
130 C T G A G A T A G A A A

Or, as a position-specific frequency matrix:

A    - 4 3 4 4 6   7 1 7 2 4
C    8 - 3 1 - 2   2 1 2 4 2
G    - 4 3 2 3 -   1 1 - - - 
T    2 2 1 3 2 2 X - 7 1 4 2
     * *     * *   * * * * *

And again, no such pattern in the infixes:

% grep '^......T' 28inf12.seq | sort | uniq -c | sort -n | tail 
 23 TTATTCTTATCT
 24 ATCTCTTGAATC
 25 CGTCGATCTTAA
 26 AGATATTCGTGA
 27 GAGCTCTAAAAG
 28 AAGTATTACGCG
 28 ATACTTTTAGAG
 30 ACTTTTTTACTT
 34 TCTTATTTAAAA
 39 GTCTTATGTATC

So, there are a few over-represented strings at the start of the reads, contributing to this, but the “worst” offender only has 130 occurrences, thus contributing 0.1% extra to the T in position 7.

Conclusion?

There is a clear bias, but it seems to be pretty weakly connected to specific hexamers, instead, it looks like e.g. any prefix with a T in position seven is more common. To me, this looks more like a binding site motif than skewed hexamer distribution. For instance, could there be a bias in the reverse transcriptase binding?

Or could it be some other effect? I can’t help but notice that position seven is the first poymerase-extended position - is it somehow easier for the polymerase to start with a T?

I don’t get it.

Acknowledgments

Thanks to my colleagues Tomasz Furmanek and Susanne Balzer for looking into this with me.

comments powered by Disqus

Feedback? Please email ketil@malde.org.