Some not-so-grand challenges in bioinformatics

by Ketil Malde; April 8, 2014

Bioinformatics is a field in the intersection of biology, statistics, and computer science. Broadly speaking, biologists will plan an experiment, bioinformaticians will run the analysis, and the biologists will interpret the results. Sometimes it can be rather frustrating to be the bioinformatician squeezed into the middle, realizing that results are far from as robust as they should be.

Here, I try to list a number of areas in bioinformatics that I think are problematic. They are not in any way “grand challenges”, like computing protein structure or identifying non-trivial factors from GWAS analysis. These are more your everyday challenges that bioinformaticians need to be aware of, but are too polite to mention.

Importance of bioinformatics

High throughput sequencing (HTS) has rapidly become an indispensable tool in medicine, biology, and ecology. As a technology, it is crucial in addressing many of our most important challenges: health and disease, food supply and food production, and ecologoy. Scientifically, it has allowed us to reveal the mechanisms of living organisms in unprecedented detail and scale.

New technology invariably poses new challenges to analysis, and in addition to a thorough understanding of relevant biology, robust study design and analysis based on HTS requires a solid understanding of statistics and bioinformatics. A lack of this background often results in study design that is overly optimistic and ambitious, and in analysis that fails to take the many sources of noise and error into account.

Specific examples

The typical bioinformatics pipeline is (as implied by the name) structured as a chain of operations. The tools used will of course not be perfect, and they will introduce errors, artifacts, and biases. Unfortunately, the next stage in the pipeline usually works from the assumption that the results from the previous stage are correct, and errors thus propagate and accumulate - and in the end, it is difficult to say whether results represent real, biological phenomena, or are simply artifacts of the analysis. What makes matters worse, is that the pipeline is often executed by scientists with a superficial knowledge of the errors that can occur. (A computer said it, so it must be true.)

Gene prediction and transcript assembly

One important task in genomics is identifying the genes of new organisms. This can be performed with a variety of tools, and in one project, we used both ab initio gene prediction (MAKER and Augustus) as well as RNAseq assembly (CLC, abyss). The results varied substantially, with predicted gene counts ranging from 16000 to 45000. To make things worse, clustering revealed that there was not a simple many-to-one relationship between the predicted genes. Clearly, results from analysis of e.g. GO-enrichment or expansion and contraction of gene families will be heavily dependent on which set of transcripts one uses.

Expression analysis

Quantifying the expression of specific genes is an important tool in many studies. In addition to the challenges of acquiring a good set of transcripts, expression analysis based on RNAseq introduces several other problems.

The most commonly used measure, reads (or fragments) per kilobase per million (RPKM), has some statistical problems.

In addition, mapping is challenging, and in particular, mapping reads from transcripts to a genome will struggle with intron/exon boundaries. Of course, you can map to the transcriptome instead, but as we have seen, the reliability of reference transcriptomes are not always as we would like.

Reference genomes

The main goal of a de novo genome project is to produce a correct reference sequence for the genome. Although in many ways a simpler task than the corresponding transcriptome assembly, there are many challenges to overcome. In reality, most genome sequences are fragmented and contain errors. But even in the optimal case, a reference genome may not be representative.

For instance, a reference genome from a single individual is clearly unlikely to accurately model sex chromosomes of the other sex (but on the other hand, similarities between different sex chromosomes will likely lead to a lot of incorrectly mapped reads). But a single individual is often not representative even for other individuals of the same sex. E.g., an obvious method for identifying a sex chromosome is to compare sequence data from one sex to the draft reference (which in our case happened to be from an individual of the other sex) Unfortunately, it turned out that the marker we found to be absent in the reference - and thus inferred to be unique to the other sex – also occurred in 20% of the other sex. Just not in the sequenced individual.

So while reference genomes may work fairly well for some species, it is less useful for species with high genomic variation (or with B-chromosomes), and simply applying methods developed for mammals to different species will likely give misleading and/or incorrect results.

Genome annotation

I recently attended an interesting talk about pseudogenes in various species. One thing that stuck in my mind, was that the number of pseudogenes in a species is highly correlated with the institution responsible for the gene prediction. Although this is just an observation and not a scientific result, it seems very likely that different institutes use in-house pipelines with varying degrees of sensitivity/specificity. Thus what one pipeline would identify as a real gene, a more conservative pipeline would ignore as a pseudogene. So here too, our precious curated resources may be less valuable than you thought.

Population genomics

In population genomics, there are many different measures of diversity, but the most commonly used is FST. Unfortunately, the accuracy of estimating FST for a SNP is highly dependent on coverage and error rate, and a simulation study I did showed that with the coverage commonly used in projects, the estimated FST has a large variance.

In addition, the usual problems apply: the genome may not be complete, correct or representative. And even if it were, reads could still be mapped incorrectly. And it is more likely to fail in variable regions, meaning you get less accurate results exactly in the places it matters most.

Variant analysis

Variant analysis suffers from many of the same challenges as population genomics, genome assembly and mapping being what they are. Again, estimated allele frequencies tend to be wildly inaccurate.

Structural variants (non-trivial indels, transpositions, copy number variation, etc) are very common, but difficult to detect accurately from mapped reads. In many cases, such features will just result in lower mapping coverage, further reducing your chances of identifying them.

Curation is expensive

Analysis is challenging even under the best of conditions – with high quality data and using a good reference genome that is assembled and annotated and curated by a reputable institution or heavyweight consortium. But in many cases, data quality is poor and we don’t even have the luxury of good references.

A de novo genome project takes years and costs millions. This is acceptable for species crucial to medicine, like human or mouse, and for food production, like cow or wheat. These are sequenced by large multi-national consortia at tremendous expense. But while we will eventually get there for the more important species, the large majority of organisms – the less commercially or scientifically interesting ones – will never have anything close to this.

Similarly, the databases with unannotated sequences (e.g. UniProt) grow at a much quicker pace than the curated databases (e.g. SwissProt), simply because identifying the function of a protein in the lab takes approximately one post doc., and nobody has the incentives or manpower to do this.

Conclusions

So, what is the take-home message from all of this? Let’s take a moment to sum up:

Curation is less valuable than you think

Everybody loves and anticipates finished genomes, complete with annotated transcripts and proteins. But there’s a world of difference between the finished genomes of widely studied species and the draft genomes of the more obscure ones. Often, the reference resources are incomplete and partially incorrect, and sometimes the best thing that can be said for them, is that as long as we all use it our results will be comparable because we all make the same errors.

Our methods aren’t very robust

Genome assembly, gene annotation, transcriptome assembly, and functional annotation is difficult. Anybody can run some assembler or gene predictor and get a result, but if you run two of them, you will get two different results.

Sequence mapping is not very complicated (at a conference, somebody claimed to have counted over ninety programs for short read mapping), and works well if you have a good reference genome, if you don’t have too many repeats, and if you don’t have too much variation in your genome. In other words, everywhere you don’t really need it.

Dealing with the limitations

Being everyday challenges, I don’t think any of this is insurmountable, but I’ll save that for a later post. Or grant application. In the meantime, do let me know what you think.

comments powered by Disqus
Feedback? Please email ketil@malde.org.