Big data revisited

by Ketil Malde; May 5, 2014

Big Data - again

I’ve recently (and much belatedly) started to get involved in the functional programming community in Munich. This encompasses:

Monday before Easter, I went to the MHM, which included an interesting talk by Rene Brunner about Big Data. Which made me think a bit. There are many definitions of “big data” (and Rene provided several of them), Wikipedia suggests:

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

I will define “big data” to be the point where a quantitative increase in problem size forces a qualitative change in how we approach them. In other words, problem sizes grow continously, but at some point, we can no longer keep up, and will have to rethink our methods and tools. Think of it as a paradigm shift in the way we do analysis.

A brief history of…me

I started out working in bioinformatics some years ago, at the Data Centre of a research institute. This department was using traditional data centre technology, meaning SQL data bases, sometimes hooked up to automatic data collection equipment, sometimes with web-based wrapping written in PHP or Java, often both.

When I arrived, bioinformatics was something new, and there was a strong inclination to shoehorn these new data into the same kind of infrastructure. My sentiment was that this was, perhaps, not the best way to go about it. In contrast to most other data, the bioinformatics data had:

In other words, bioinformatics represented a “big data” paradigm shift for the data centre. SQL databases were no longer very useful the primary storage (but sometimes used as an application optimization).

The next paradigm shift

Now, we’ve all seen the graphs showing how the capacity of sequencing technology is growing exponentially, and at a much faster rate than Moore’s law. In other words, the computational power to data size ratio is diminishing, and even for algorithms that scale linearly with input size (and that’s pretty optimistic for an algorithm), compute power is becoming a scarce resource. Again, we can’t keep up.

And although people have been knowing and showing this for years, somehow there’s always money for equipping labs with more sequencers, and grants for new sequencing projects. All too often, the analysis is just added as an afterthought.

And if you allow me to digress, this makes me wonder: what if the tables were turned, if it were the other way around? What if I got to tell my biologist colleagues that, hey, I just got money to buy yet another new supercomputer with the capacity to analyze the expression data for every human gene - now I need you to go to the lab and do twenty five thousand qPCR runs for each of these twenty samples. Could you have it done by Monday? And by the way, we’re hiring some more computer scientists.

Scaling

I’ve previously argued that our methods in bioinformatics is not quite where we would want them to be, and in many cases, they tend to produce outright incorrect results. But in addition, I think they no longer scale.

Take the de novo genome project, for instance. For something bigger (genome-wise, of course) than a bacterium, assembly of a de novo genome is a major undertaking. Sequencing is only a small fraction of the cost, and you really need a team of experienced scientists and engineers to produce something useful and reliable. And lots and lots of computation.

And although producing curated “resources” (like genome assemblies) can be useful and get you published, often there are more specific goals we want to achieve. Perhaps we are only interested in some specific aspect of the species, maybe we want to design a vaccine, or study the population structure, or find a genetic sex marker.

Lazy vs strict analysis

Depending on a curated resources for analysis is an example of an eager approach. This is fine if you work on the handful of species where you have a large and expensive consortium that already created the curated resource for you. And of course, someone nearby who can afford a supercomputing facility.

But for the remaining species, this isn’t going to work. Even if somebody decided to start prioritizing the computational and analytical aspects of biology, it would still consume a lot of valuable resources. Instead of just pouring money over compute infrastructure, we need to be more efficient. That means new algorithms and, I think, new approaches to how we answer scientific questions.

I don’t have all the answers to this, but rethinking how we do analysis could be a start. As programmers often forget, the best optimization to avoid doing the unnecessary bits. A starting point would be to exchange the eager approach for a lazy (or at least non-strict one. Instead of starting with the data and asking what we can do, we need to start with the question, and then ask how we can answer it from the data.

Why it’s not going to happen

Writing this, I realize I have said it all before. It is easy to blame the non-progress on backwards grant agencies, biologist-trained management, stubborn old professors trapped in paradigms of the past, or any other forms of force majeure. But there is a fundamental problem with this line of thought.

Although compute resources are quickly becoming scarce, another resource – the bioinformaticians – is even scarcer. And instead of churning through a slow, expensive, and error prone stack of tools to generate a sequence of intermediate products (genome assembly, gene prediction, functional annotation, etc), and then, finally, looking up the desired result in a table, I believe a lazy approch will need to tailor the methods to the question. In other words, scientists will have to write programs.

comments powered by Disqus
Feedback? Please email ketil@malde.org.