# Expected site information from SNPs

by *Ketil Malde*; **July 2, 2014**

Lately, I’ve been working on selecting SNPs, the main goal is often to classify individuals as belonging to some specific population. For instance, we might like to genotype a salmon to see if it is from the local population or an escapee from a sea farm, or perhaps a migrant from a neighboring river? And if it’s an escapee, we might want to know which farm it escaped from. In short, we want to find SNPs that are *diagnostic*.

Typically, this is done by sequening pools of individuals, mapping the reads to the reference genome, identifying variant positions, and ranking them - typically using *F*_{ST}, sometimes also using p-values for the confidence in an actual allele difference, and maybe filtering on sequencing coverage and base- or mapping quality. However, *F*_{ST} really isn’t a suitable tool for this purpose. I’m therefore proposing the following. Let me know if it makes sense or not.

# Expected Site Information

For diagnostic SNP, what we really would like to know is the amount of *information* observing each site contributes. Using Bayes theorem, observing an allele *a* in some individual *N*, gives us the following posterior probability for *N* belonging to some population *A*, where the allele frequency, *P*(*a*∣*A*), is known:

`P(A|a) = P(a|A)P(A)/P(a)`

Here, P(A) is our prior probability of *N* belonging to *A*, which after observing `a`

is modified by a factor of

`P(a|A)/P(a)`

In order to assign *N* to one of several populations (either (A) or *B*, say), we are interested in the *relative* probabilities for the two hypotheses. In other words, we would like the *odds* for *N* belonging to one population or the other. Given the probabilities of *P*(*a*∣*A*) and (P(a|B)), and initial odds (P(A)/P(B)), we get

`P(A|a)/P(B|a) = [P(a|A)P(A)/P(a)]/[P(a|B)P(B)/P(a)]`

Canceling out *P*(*a*), we find that the prior odds are modified by:

`P(a|A)/P(a|B)`

That is, the ratio of this allele’s frequencies in each of the populations. For practical reasons, it is common to take the logarithm of the odds. This gives us scores that are additive and symmetric (so that switching the two populations gives us the same score with the opposite sign). Specifically, base two logarithms will give us the score in bits.

When observing a site, we may of course also encounter the alternative allele. By the same reasoning as above, we find that this allele modifies the odds by

`[1-P(a|A)]/[1-P(a|B)]`

Lacking any prior information, we can consider each population equally likely, and the likelihood of observing a particular allele is the average of the likelihood in each population. The information gain from each possible allele is then averaged, weighted by this average likelihood. For a biallelic site with major allele frequencies *p* and (q) (and consequentially, minor allele frequencies of 1 − *p* and (1-q)) in the two populations, the expected added information from the site then becomes:

`I(p,q) = |(p+q)/2 log_2(p/q)| + |(1-(p+q)/2)log_2((1-p)/(1-q)) |`

Note that we are here only interested in the amount of information gained, regardless of which hypothesis it favors, and thus we take the absolute values. For a site with multiple alleles enumerated by *i* and with frequency vectors **p** and **q** in the two populations, this generalizes to the weighted sum of *l**o**g*_{2}(*p*_{i} / *q*_{i}).

Unlike measures like *F*_{ST}, measures of *I* is *additive* (assuming independence between sites), so the information gained from observing mulitple sites is readily calculated. From observing the information gained from observing each site, we will also be able to compare different sets of sites, and e.g., compare the value of a single site with minor allele frequencies (MAF) of, say, 0.1 and 0.3 to two sites with MAF of 0.2 and 0.3.

It may also be instructive to compare this procedure to sequence alignment and position specific score matrices (PSSMs). In sequence alignment, a sequence of nucleotides or amino acids are scored by comparing its match to a target sequence to its match to some base model using log odds scores. The base model to compare against is often implicit (typically using sequences of random composition), but more elaborate models are also possible Similarly, position specific frequency matrices are often converted to position specific score matrices using log odds. Calculating the information value from a set of observed alleles is then analogous to scoring an “alignment” of the set of observed alleles to two different sets of allele frequencies.

# Allele frequency confidence intervals

In order to apply the above method in practice, we need to measure the allele frequencies in the population. This is problematic for two reasons. First, we do not have precise knowledge of the allele frequencies, we can only estimate them from our sequenced sample. This introduces sampling bias. Second, the sequencing process introduces additional artifacts. For instance, sequencing errors often result in substitutions, which are observed as apparent alleles. In addition, sequences can be incorrectly mapped, contain contamination, the reference genome can contain collapsed repeats, and the chemistry of the sequencing process is usually also biased – for instance, coverage is often biased by GC content. These artifacts often give the false appearance of variant positions.

One challenge with calculating site information from sequencing data (as opposed to using allele frequencies directly), is that such errors in the data can vastly overestimate the information content. For instance, an allele that appears to be fixed in one population means that any other observed allele will assign the individual to the alternative population - regardless of any other alleles. It is easy to see that an allele frequency of zero results in the odds going either to zero or infinity, and thus the log odds will go to either positive or negative infinity.

For diagnostic SNP discovery, it is more important to ensure that identified SNPs are informative, than to precisely estimate the information content. Thus, we take a conservative approach and use upper and lower limits for the allele frequencies by calculating confidence intervals using the method by Agresti-Coull. In addition, the limits are also adjusted by a factor *ε*, corresponding to sequencing error rate.

# Software implementation

I’ve implemented this (that is, the conservative measure) as described above in a software tool called varan. It parses sequence alignments in the standard “mpileup” format as output by the **samtools mpileup** command. It can currently output several different statistics and estimators, including expected site information. This is a work in progress, so please get in touch if you wish to try it out.