A big data approach to bioinformatics

by Ketil Malde; September 10, 2013

Early this year, I was at BIOSTEC in Barcelona, where I did a presentation on using Software Transactional Memory for a bioinformatics task - in this case genome scaffolding. As usual, the conference was fun and interesting, and very well organized.

One of the sessions were on Big Data, which is a topic that comes up regularly. This is extremely relevant to bioinformatics, since sequencing cost is decreasing at a rate much faster than Moore’s law. So even if computer performance had continued at the trajectory it used to, doubling every 18 months, it would still be too low to keep up. According to one of the figures presented, in 2017 it will cost about one doller to sequence a genome.

I don’t think anybody has any real answers to this. And in some sense it’s moot, because the precious resource, the limiting factor hasn’t been sequencing for some time already. Nor is it compute power, nor storage. It is analysis. It simply takes too many hours to do the work - read up on methods, install software, develop pipelines, debug pipelines, and so on. The heterogeneity of tasks (unlike, say meterology or seismics, no two biologists seem to be looking for the same thing)

Curated genomes and wishful thinking.

Now, I am at GIA2013, also a great experience, this time presenting an abstract on more effective selection of diagnostic SNPs - that is, finding working genetic markers to identify which population any individual belongs to. (Of course, simulating sequencing using FlowSim). In the round-table discussion, many panelists expressed and emphasized the need for completed genomes - we need the genome to identify isoforms, to analyze promotor regions, and so on.

But again - sequencing is cheap, analysis - in this case sequence assembly, scaffolding, gene prediction, functional annotation, etc etc is the expensive thing. E.g., the salmon genome has cost many million dollars so far, and I don’t know how much it will cost in the end, or how good it will be. But - and I think this is important - I’m certain that it still will not give you all the answers. You will - practically speaking - never know all transcription factor binding sites, microRNAs, or splice variants. And it is quite unlikely that the sequence will be free of misassemblies and artifacts. In short, the data will never be perfect.

And it is important to realize that there is a diminishing returns effect here. A draft genome can be had for a few thousand dollars these days. It will of course be very fragmented, but you can probably find most genes, especially if you’re willing to do some puzzling together and exon hunting yourself. But a complete, annotated genome is still going to be expensive, and take a long time - so much, that for most species, it is simply not a realistic option.


I think the solution is to accept as a simple fact of life that our reference data is never - and will never be complete and correct. And develop tools that do not make any such assumption. That are robust to errors and miscuration. That rely more on raw data than on curation.

If we take a step backwards and examine the situation, we realize that a curated genome (or transcriptome, or whatever) is not a goal in itself. It is a means to an end. And that end is answering scientific questions. You don’t necessarily need all the possible isoforms, but perhaps you need the isoforms present in your data. Perhaps you only want to examine a specific metabolic pathway - and you may be able to do without the exact number of paralogs, as long as you can identify their total expression. And so on. So we need tools that a) can use lower quality or raw data more directly, instead of relying on (manual, expensive) curation, and b) can derive the necessary structures on the fly - and only the necessary ones.

So instead of, say, spending $20M on getting a genome 98% correct, I would spend $10M and get it to 95% correct, and spend the remaining $10M on tools that take the incorrectness into account. This has several benefits:

CPU vs analysis

So to revisit the original point - Moore’s law. Applying the big data approach, and trying to pull out the answers to scientific questions directly from the raw data will probably be very computationally heavy.

But curation (including e.g. genome assembly and annotation) is also computationally heavy, but more importantly, quite analysis heavy. And analysis is the scarce resource.

Some examples

Okay, let’s try to look at some problems more constructively. Below are some quick ideas, I’m sure you can come up with some of your own.


One common problem using incomplete reference genomes, is that genes will often be distributed across multiple contigs. Thus biologists often have to piece them together by hand from multiple matches. It should be straightforward to write a tool, a small wrapper around BLAST, that does this automatically.

Transitive alignments

Sensitivity in sequence alignment used to be thought of as the holy grail. Then all the rage went into building (curated!) stochastic models of available and identified sequences instead, and searching with them.

But another approach would be to use those other sequences directly, omitting the stochastic modelling step. That is, blast against some huge database with all kinds of unidentified sequences (e.g, NR), and combine the results with hits from this database (NR) to a curated database (e.g., SwissProt). This improves sensitivity, since you are originally searching against all known sequences, and specificity, since most curated sequences will have good hits against some known sequences.

(This is basically what transalign does, I don’t know anything else that identifies long-range structural relationships as well.)

Gene expression

Gene expression can be a challenging task, and I think, an inexhaustible supply of inaccurate results. Generally, one would map RNAseq data against a reference transcriptome (or possibly genome), and try to derive the expression of individual genes from the number of reads that map to the putative transcripts. The problem here is mainly the reference transcriptome, for many species, it will not be very accurate, and the actual number of different transcripts is usually highly dubious. So people would of course ask for more resources going into building better genomes and transcriptomes.

But often, we are not really all that interested in the exact transcripts. Perhaps we just want to know if a certain metabolic pathway is up- or downregulated. Or whether genes associated with specific functions or processes are affected. In that case, why not cluster the putative transcripts against each other and KEGG orthologs, SwissProt reference proteins, or GO terms - and simply analyze the read counts per cluster. True, this will not really tell you the number of genes involved, but the counter argument is that you don’t know this anyway. So it will be less precise, but much more accurate.

And it has the advantage that you can combine information from many different putative transcriptomes (ab initio prediction, EST assemblies, RNAseq assemblies), mitigating their individual weaknesses.

Anyway, I’ve written a sequence clustering program with this in mind, and hope to find time to move forward with this.

comments powered by Disqus
Feedback? Please email ketil@malde.org.