Ketil's blog

The cost of fixing climate change

Sat, 24 Dec 2016 11:00:00 UT

We need to find altnerative sources of energy, to replace our dependency on fossil fuels. There is a lot of talk about solar, about how economically profitable it has become, how all this free renewable energy is waiting to be harvested, and how the prices keep plummeting.

Being, I hope, of a somewhat economic mindset, I don't believe in the existence of hundred dollar bills on the pavement. The large solar plants I can find numbers for (e.g., Topaz) seen to cost about USD 2.4 billion to deliver slightly more than 1 TWh/year. And solar panels drop in cost, but obviously a construction site many square miles in size is going to be expensive, no matter the cost of materials.

Olkiluoto nuclear power station (photo from Wikimedia by Hannu Huovila)

Another issue with solar - recent price drops seem to be as much caused by moving production to the Far East. This means leveraging cheaper labor, but perhaps even more so, leveraging cheap, subsidized power, mostly from coal. Currently, I suspect the solar industry consumes more power than it produces, meaning it currently accellerates climate change. (This will change when growth rates drop, but at 0.2% of our global energy coming from solar, this could be some years off)

And large installations in the Californian desert is one thing, but up north where I live? There aren't many real installations, and thus not much in the way of numbers. There is a Swedish installation in Västerås claiming it will be able to produce power at about the same cost as Topaz. This would indeed be remarkable. Neuhardenberg in Germany cost €300 million (15% of Topaz), but only produces 20 GWh (2%), which seems remarkably low again. I find it hard to trust these numbers. Another comparison: the UK invests around US$5 billion per year in photovoltaics, and generated electricity seems to rise 2-3 TWh. This points to $2/kWh/year, a Topaz-level ROI, which is actually rather impressive for a rainy country far to the north.

All of this ignores necessary infrastructure; we need a way to store energy from sunny days, and in the north, from the warm summer to freezing winters, where energy use for heating quadruples my electric bill. Yes, we can pump water uphill for hydroelectric backup, but this has both a construction cost, an efficiencly loss in the pumps and turbines, and, I think, an opportunity cost, since if you have the opportunity of building a hydroelectric plant, you might consider just doing that, and generate clean power without an expensive solar installation. The alternative backup power for those who aren't blessed with rainy, tall mountains, seems to be natural gas, which of course is a fossil fuel, and where the quick single-cycle plants are considerably less efficient.

There was an interesting, but perhaps overly pessimistic paper by Ferroni and Hopkirk calculating that taking everything into account, solar panels at lattitudes like northern Europe would never return the energy invested in them. Others disagree, and in the end, it is mostly about what to include in "everything".

If you look at many total cost of power estimates, nuclear is often comparable to solar. I find this difficult to swallow. Again looking at actual construction costs, Olkiluoto may cost as much as €8 billion. But the contract price was €3 billion, and it is unclear to me if it was the contractor who messed up the estimates, or the later construction process - so it is hard to say what it would cost to build the next one. And in any case, the reactor will produce 15 TWh/year, so you get maybe thirteen Topaz'es worth of energy at two to four times the investment.

I think solar and nuclear make for a good comparison, both technologies are extremely expensive to build, but are then quite cheap to run, and for nuclear, much of the variation in total cost estimates depend on discounting rates, and decommissioning cost. Financing costs should be the same for solar, and although nobody talks about decommissioning, Topaz is 25km² of solar panels that must be disposed of, that's not going to be free either. We can also consider that reactors can run for maybe fifty to seventy years, solar panels are expected to last for 25 to 30.

In the end, both solar and nuclear can supply energy, but nuclear appears to be a lot cheaper.

* * *

If you look at what's happening in the world, the answer is: not much. Sure, we're building a couple of solar power plants, a few reactors, a handful of windmills. Is it simply too expensive to replace fossils?

I did some calculations. Using Olkiluoto as a baseline, a country like Poland could replace its 150TWh coal-fired electricity production for €30-80 billion. A lot of money, to be sure, but not outrageously so. (Of course, coal-fired electricity is just a fraction of total fossil use, think of transportation - but it's a start, and it's the low hanging fruit).

Another comparision: on the 'net, I find US total consumption on oil to be equivalent to something like 35 quadrillion BTUs. My calculations make this out to be 10 000 TWh, and somewhat fewer than 700 reactors would (if we assume we get better at this, and that after building the first hundred or so, we can do it for USD 3 billion, about the original price tag), something over two trillion dollars.

Which happens to be the current estimate for the cost of the Iraq war. Isn't it ironic? In order to attempt to secure a supply of oil (and by the way, how did that work out for you?), the US probably spent the same amount of money it would have taken to eliminate the dependence on oil, entirely and permanently. And which, as a byproduct, would have avoided thousands of deaths, several civil wars, millions of refugees, and -- almost forgot: helped to stop global warming. (The war in Afghanistan seems to have been slightly cheaper, but that was a war to eradicate extremism and terror, and although it seems hard to believe, it was even less successful than the war in Iraq. I can only look forward to our coming intervention in Syria. But I digress.)

* * *

Back to climate change. It isn't sufficient to produce enough alternative energy, of course, what we want, is to stop the production of fossil fuels. In a global and free market economy, oil (like any other product) will be produced as long as it is profitable to do so. In other words, we must supply the alternative energy, not to satisfy our current consumption, but to drive the price of energy low enough that oil extraction no longer turns a profit.

Some oil fields are incredibly inexpensive to run, and it's highly unlikely that the price will ever drop so low that Saudis are going to stop scooping oil out of the dunes. But with recent oil prices above $100/barrel, many newer fields are expensive. Tar sands in Canada, fracking and shale oil, and recent exploration in the Arctic - I suspect these projects are not profitable, and they certainly have a high degree of financial risk.

Yet, people go ahead, and we just got some new licenses issued for the Barents sea. I'm not an analyst, but its difficult for me to imagine that these will ever be profitable, and yet they go ahead. But similar to solar and nuclear, oil extraction requires a huge initial investment, and when the field is in production, keeping it producing will stil be profitable. In Norway, the oil business is a mixture of government and private sector initiatives, and my bet is that companies are racing to get the investments in place. And either the public sector guarantees for the investment, or the companies gamble on later bailouts - in either case, the people involved get to keep their jobs.

HP EliteBook Folio G1

Mon, 12 Sep 2016 10:00:00 UT

I got a new laptop the other day. My aging Thinkpad X220 was beginning to wear out, keys were falling off and the case was starting to crack here and there. Sometimes the system log would contain ominous messages about temperatures and such. So it was finally replaced with a new laptop, a HP EliteBook Folio G1.

EliteBook Folio G1 as presented by HP.

This is a Mac-a-like thin little thing, very shiny and sleek compared to the Thinkpad's rough and rugged exterior. They are both "business models", where the Thinkpad has a look that clearly means business, the HP has more the "executive" feel, and there is probably a matching tie pin and cuff links somewhere near the bottom of the very elaborate cardboard package.

I took care to order the model with the non-touch, full-HD display, and not the one with a 4K touch display. I don't really see the point of a touch display on a PC, and I'd rather have the lighter weight, less reflective screen, and longer battery life. And it's only 12 inches or so, how small do pixels need to get, really? Full HD is something like 180 dpi, I'm pretty sure it suffices.

Did I mention shiny? I was surprised, and somewhat disappointed to unpack it and discover that the display is indeed very reflective. It's nice and bright (especially compared to the totally mediocre display on the old Lenovo), but you still need to dress like Steve Jobs (i.e. black sweater) to see anything else than your own reflection. My suspicions were further aroused when I discovered the screen actually was a touch screen, after all. I then weighed the thing, and true enough, it is ten percent heavier than it should be.

After going back and checking the model number (which was correct) and the description at the seller (which was clearly wrong), it was made clear that - although they had given the wrong specifications, a return was not possible. So that's that, I guess, I'm stuck with a heavier laptop with a less usable screen than I thought I would get. So be warned, model has a reflective touch display, even though specs don't say so.

The rest of the hardware is okay, for the most part. It's still fairly light at just a bit over one kg, and apart from its two USB-C ports, the only other connector is an audio jack. The chicklet keyboard is...not very good, compared to the Lenovo's, but I guess I'll learn to accept it. At least it's better than Apple's very shallow version. Oh, and I really like the fanless design. It's not a quiet PC - it's a silent one, and the only moving parts are the keyboard keys.

Apple may have more fanatic users, but they too don't much care for the Mac built-in keyboard.

Installing Linux

As an old-time Linux user, of course I wanted to install my favorite distribution. My previous computer had Debian installed, but I thought I'd try the latest Ubuntu for a (very small) change. Since I am particular in what software I use, and have little use for complex "desktop environments", my usual modus operandi is to install a minimal system -- meaning a Ubuntu server install -- and then pull whatever software I need.

So I copied the server install image to a USB stick (it took me ages of meddling with specialized and buggy USB stick creator software before I realized you can simply dump an ISO directly with something like ), and it happily booted, installed, and rebooted. Only to discover that networking is not supported. Not the built-in wifi, nor the USB-brick's ethernet. Really, Ubuntu?

Microsoft Windows tries to "repair" my Linux install, but gave up after a long period of trying.

Disappointed, I decided to try Debian, which provides a 'netinst' image. Unfortunately, the computer's UEFI copy protection fascist control skynet terminator system kicked in, and Debian images are non grata. So booting into the BIOS, meddle around with UEFI settings and "secure boot", to get this to work. For some reason, I needed to do this several times before it would take, and although I'm pretty sure the BIOS initially identified my USB stick by manufacturer and model, it now gave some generic description. It also wouldn't boot it directly, but there was a "boot from file" option I could use. But of course, installation failed here as well, network configuration didn't work.

Back to Ubuntu, the desktop image this time. And although I still needed to "boot from file" and dig around for an EFI image (or some such), it now managed to configure the network properly, and the install went pretty smoothly from there. I am now running Ubuntu 16.04, and wondering why the server image doesn't come with the same drivers as the desktop image. As long as the network is up, everything else can be fixed - and unless you are installing from a stack of DVDs, something I vaguely remember doing back in a previous century, without networking, you can't get anywhere.

Software

Not being a fan of desktop environments, I prefer to run the Xmonad window manager. After -ting it, I can select it at login, and things work the way they should.

That is: almost.

Since I don't have a graphical tool to inspect or configure networking, I previously used for this. Running this in the background (as root, of course), it reads a config file where all my networks are configured. When it connects and authenticates to a network, I can then run to obtain IP and things like DNS server.

My initial attempt at this config failed, and I tried instead. That didn't work too well last time I tried, which I think is the primary reason I used . The command connects with the NetworkManager daemon, and allows me to connect to networks with something like:

      nmcli dev wifi connect my_ssid password my_psk_key

NetworkManager stores this information in , not unlike the config file, and after adding things once, it switches automatically - and unlike , also between other network interfaces. And it uses to proxy DNS queries, also in a smart manner (as far as I can tell).

A new install is also a good time to revisit configurations. I had a setup where XMonad's terminal command would bring up an

    uxterm +sb -font '-*-fixed-medium-r-*-*-14-*-*-*-*-*-*-*'

but that font is way too small, and doesn't look very good. After some googling, I ended up with XTerm and specifically:

    xterm -fn 7x13 -fa 'Liberation Mono:size=12:antialias=false' +rv

which I think looks pretty good. (Not sure why I have the antialias option, changing it to 'true' doesn't appear to change things, and fonts look to be antialiased regardless)

Docking

I got a docking station as well. Unlike the old Lenovo, where you press the whole PC down in a specialized dock, the new one is just a smallish brick with a bunch of connectors. It hooks up to the PC with a single USB-C cable, which supplies both power, and...well, USB. Running shows some devices, including a virtual disk, an ethernet interface, and something called a DisplayLink.

    Bus 004 Device 003: ID 17e9:4354 DisplayLink

Ethernet and USB appears to work out of the box, but the external monitor doesn't show up, and 'xrandr' only displays the built-in display. Googling "displaylink" quickly brings one to http://www.displaylink.com/ which also have downloadable Linux drivers for Ubuntu. Distributed as a self-extracting shell script, haven't seen one of those for a while. First I needed to , to install support for building kernel modules, and after that, the display link driver made the external monitor available to xrandr configuration - and rescaled my screen and cloned it to the external display. So far so good.

There is a downside, however. The external display introduces a noticeable lag, and the DisplayLink daemon often hogs the CPU quite a bit. I also think it incurs a cost to graphically intensive programs, but so far I only noticed that chromium appears more CPU-greedy, so I could be wrong. There's a bunch of "WARNING" dumps in syslog, in addition to messages like this:

    [107073.458870] evdi: [W] collapse_dirty_rects:102 Not enough space for clip rects! Rects will be collapsed

so I find it a bit hard to entirely trust this system. But at least it works.

Battery

Battery was supposed to be reasonable. Much of this depends on the CPU being able to enter sleep states. The obvious thing to do is to use 'powertop', go to the rightmost tab, and enable all kinds of hardware and device power saving. But userspace is also important, and I find that my web browser - typically with tens of tabs open on sites peppered with javascript and flash and whatnot - tend to keep the system very busy. Chromium consists of a giant cluster of threads, but as I start it from a terminal, I just bring it to the foreground and hit ctrl-Z.

Having done that, I recheck powertop, and apparently the wifi interface and the bluetooth keep interrupting the system. I don't think the function button for disabling wireless works, but running brings the power use down from 6W (about five hours) to 5.8W (five and a half)¹. on the DisplayLink process brought it down to 4.4 (over seven hours), but then bounced back to 5.4 (6 hours). Dimming the backlight gets it down to 4.2W - but, okay, a dark screen and a PC doing nothing - how useful is that? In the end, what matters is how long a flight I can take and still expect to be able to use my computer, and only experience will tell.

And one nice thing about USB-C is that with a universal standard, chances are someone nearby will have a compatible charger - much like for mobile phones these days. And another nice thing is that it is already possible to buy external battery packs with USB-C that will keep your computer juiced up even for really long haul flights.

In the end?

Some things are still not working (e.g. special keys to adjust display brightness or switch off wireless - strangely, the key to dim keyboard backlight works), and other things are funky or unpredictable (e.g. that you must use the "fn" key to use function keys F1 to F12, or the direction two-finger scroll works). But by and large, things work pretty well, really.

And reinserting the iwlwifi module, the device pops back into existence, and after a few seconds, NetworkManager has me reconnected. Nice!↩

CAS-based generic data store

Wed, 03 Aug 2016 10:00:00 UT

Bioinformatics projects routinely generate terabytes of sequencing data, and the inevitable analysis that follows can easily increase this by an order of magnitude or more. Not everything is worth keeping, but in order to ensure reproducibility and to be able to reuse data in new projects, it is important to store what needs to be kept in a structured way.

I have previously described and implemented a generic data store, called medusa. Following the eXtreme Programming principle of always starting with the simplest implementation that could possibly work, the system was designed around a storage based on files and directories. This has worked reasonably well, and makes data discoverable and accessible both directly in the file system, and through web-based services providing browsing, metadata search (with free text and keyword based indexes), BLAST search, and so forth.

Here, I explore the concept of content adressable storage (CAS), which derives unique names for data objects from their content.

The CAS principle

The essence of any storage system is being able to store objects with some kind of key (or label, or ID), and being able to retrive them based on the same key. What distinguishes a content adressable storage from other storage systems is that the key is generated from the entire data object, typically using a cryptographic hash function like MD5 or SHA1.

This means that a given object will always be stored under the same key, and that modifications to an object will also change its key, essentially creating a new object.

A layered model

Using CAS more clearly separates the storage model from the semantics of data sets. This gives us a layered architecture for the complete system, and services are implemented on top of these layers as independent and modular programs.

The object store

The object store is conceptually simple. It provides a simple interface that consists of the following primitive operations:

put: a data object into the store
list: the keys that refer to data objects
get: a data object using its key

The storage itself is completely oblivious to the actual contents of data objects, and it has no concept of hierarchy or other relationships between objects.

Metadata semantics

When we organize data, we do of course want to include relationships between objects, and also between data objects and external entities and concepts. This is the province of metadata. Metadata semantics are provided by special metadata objects which live in the object store like any other objects. Each metadata object defines and describes a specific data set. As in former incarnations of the system, metadata is structured as XML documents, and provides information about (and the identity of) the data objects constituting the data set. It also describes the relationship between data sets, for instance allowing new versions to obsolete older ones.

The metadata objects are primarily free-form text objects, allowing users to include whatever information they deem relevant and important. The purpose of using XML is to make specific parts of the information computationally accessible, unambiguous, and standardized. For instance, structured references (i.e. specific XML elements) to data objects with their key allows automatic retrieval of the complete dataset. In addition to referencing objects in the object store, similar structures allow unambigous references to external entities, for instance species, citation of scientific works, and uniform formatting of dates and geographic locations.

A command line interface to the metadata is provided through the `mdz` command, this allows a variety of operations on data sets, including listing, importing, exporting, and synchronizing with other repositories. In addition, the system implements a web-based front end to the data, as well as metatdata indexing via xapian.

Data objects and services

As shown in the previous sections, the system can be conceptually divided in three levels: the object store, the metadata level, and the data semantic level. A service typically accesses data on one or more of these levels. For instance, a (hypothetical) service to ensure distributed redundancy may only need to access the object store, oblivious to the contents of the objects. Other services, like the (existing) functionality to import data sets, or transfer data sets between different servers, need to understand the metadata format. And even more specific services may also need to understand the format of data objects - e.g. the BLAST service scans metadata to find FASTA-formatted sequence data, and integrate them into its own database. The important principles that services adhere to are: 1) a service can ignore anything that is irrelevant to it, and 2) can reconstruct its entire state from the contents of the object store.

Discussion

CAS Advantages

Perhaps the primary advantage of using the hash value as the ID for data objects, is that it allows the system to be entirely distributed. The crucial advantage is that keys (by definition) are unique to the data. With user-selected keys, the user must somehow ensure the uniqueness of the key, and this requires a central authority or at the very least an agreed-upon naming scheme. In contrast, names for objects in CAS depend only on the contents, and the system can be implemented with no central oversight.

That keys depend on contents further means that data are immutable - storing a modified data object results in a different key. Immutability is central to reproducibility (you won't get the same results if you run your analysis with different data), and previously this was maintained by keeping a separate registry of metadata checksums, and also including checksums for data objects in the metadata. This made it possible to verify correctness (as long as the registry was available and correct), with CAS, this becomes even easier since the checksum is the same as the name you use to retrieve the data object.

Another benefit is deduplication of data objects. Objects with the same contents will always be stored under the same key, so this is automatic. This also makes it easier to track files across renames (analyses tend to produce output files with generic names like "contigs.fasta", it is often useful to give these files a more descriptive name), with CAS it becomes trivial to check if any file exists in the storage.

Decoupling the data from a fixed filesystem layout introduces another level of abstraction, and this makes it easier to change the underlying storage model. In later years, key-value storage models have replaced relational databases in many applications, in particular where high scalability is more important than structured data. Consequently, we have seen a plethora of so-called "NoSQL" databases emerge, including CouchDB, Cassandra, and many others, which could be plugged in as an alternative back-end storage. Storage "in the cloud", like Amazon's S3 or Google's Cloud Storage are also good alternatives.

The added opacity makes it less likely (but still technically possible) for users with sufficient privileges to perform "illegal" operations on data (for instance, modification or removal).

Disadvantages

The implicit assumption for CAS is that different data objects hash to different hash values. In an absolute sense, this is trivially false (since there only exist 2¹⁶⁰ possible hash values, and an infinity of possible data objects). But it is true in a probabilistic sense, and we can calculate the probability of collisions from the birthday paradox. For practical purposes, any collision is extremely unlikely, and like the revision control system git (which also is CAS-based), collisions are checked for by the system, and can be dealt with manually if they should occur.

Abstracting out the storage layer can be an advantage, but it also makes the system more opaque. And although the ability of humans to select misleading or confusing names can hardly be underestimated, even a poorly chosen name is usually more informative than the hexadecimal key representing a hash value.

Mixed blessings

Previous versions used a fixed directory structure, where each data set included a metadata file, and an arbitrary set of data files. Using a content adressable object store is more flexible, and there is nothing preventing the implementation of a parallel metadata scheme sharing the same data store, and even referring to the same data objects. One could also create metadata objects that refer to other metadata objects. As always, fewer restrictions also means more opportunities for confusion and increased complexity.

Perhaps the most drastic change is how datasets can have their status changed - e.g. be marked as obsolete or invalid. Previously, metadata was versioned, meaning there could exist a (linear) sequence of metadata for the same dataset. This was enforced by convention only, and also required a central synchronization of metadata updates to avoid name and version collisions. Since the object store only allows the addition of new objects, and in particular, not modification, status updates can only be achieved by adding new objects. Metadata objects can refer to other datasets, and specify a context, for instance, a data set containing analysis results can specify being based on a data set containing input data for the analysis. Status changes are now implemented using this mechanism, and datasets can refer to other data sets as "invalidated" or "obsoleted".

Current Status and Availability

The system is currently working on my internal systems, it is based on standard components (mostly shell scripts), and although one might expect some rough edges, it should be fairly easy to deploy.

Do let me know if you are interested.

Why we should stop talking, and start to prepare for climate change

Mon, 23 May 2016 20:00:00 UT

The other day, I attended a meeting organized by my local University. Part of a series dealing with the Horizon 2020 themes, this one dealt with energy - and specifically, how we should replace our non-sustainable dependency on fossil fuels.

Professionally led by a well-known political journalist, it started with an introductory talk by a mathematician working with geothermal energy, specifically simulating fracturing of rock. Knowledge about the structure of cracks and fractures deep below can be used in the construction of geothermal energy plants - they produce power basically by pumping cold water down, and hot water up - so exploiting rock structre can make them more effective. It was an interesting talk, with a lot of geekish enthusiasm for the subject.

Then there was a panel of three; one politician, one solar panel evangelist-salesperson, and a geographer(?). And discussion ensued, everybody was talking about their favorite stuff on clean energy, and nobody really objected or criticized anything.

Which is, I think, highlights the problem.

When they opened for questions from the public, the first one to raise her voice was a tall, enthusiastic lady in a red dress. She was a bit annoyed by all the talk about economy and things, and why don't we just fix this?

And she is right - we can. It's just a question of resources. I recently looked at the numbers for Poland, which is one of the big coal-users in Europe¹, producing about 150 TWh of electricity² per year from coal.

Using the (now rather infamous) Olkiluoto reactor as a baseline, the contract price for unit 3 was €3 billion (but will probably end up at 2-3 times that in reality). Unit 1 and 2 which are in operation have about the same capacity, and deliver about 15 TWh/year. So, depending on how you want to include cost overruns, we can replace all coal-based electricity production in Poland with ten Olkiluoto-sized reactors for €30-80 billion. (I think it is reasonable to assume that if you build ten, you will eventually learn to avoid overruns and get closer to the price tag. On the other hand, the contractor might not give you as favorable quotes today as they gave Finland.)

Similarly, the Topaz solar power plant in the Californian desert, cost $2.4 billion to build, and delivers something above one TWh/year. Again, scaling up, we would need maybe 130 of these, and a total cost of about € 280 billion. (Granted, there are some additional challenges here, for instance, anybody going to Poland will immediately notice the lack of Californian deserts at low latitudes.³

So yes: we can solve this. But we don't. I can see the economic argument - we're talking about major investments. But more imporatntly, the debate was almost entirely focused on the small stuff. The seller of solar panels was talking at length about how the government should improve the situation for people selling solar panels. The academics were talking about how the government should invest in more research. The journalist was talking about Vandana Shiva - whom I'm not going to discuss in any detail, except notice that she is very good at generating headlines. The politician was talking about how he would work to fund all these good causes. And the topics drifted off, until at the end somebody from the audience brought up regulations of snow scooter use, apparently a matter of great concern to him personally, but hardly very relevant.

So these people, kind-spirited and idealistic as they are, are not part of the solution. Politicians and activists happily travel to their glorious meetings in Doha and Copenhagen, but they won't discuss shutting down Norwegian coal mines producing about two million tons of coal per year, corresponding to a full 10% of Norway's entire CO2 emissions. And unlike oil, which is a major source of income, this mine runs with huge losses -- last year, it had to be subsidized with more than € 50 million. Climate is important, but it turns out the jobs for the handful of people employed by this mine are more so. And thus realpolitik trumps idealism. Sic transit gloria mundi.

Subsidized by well-meaning politicians and pushed by PR-conscious business managers, we'll get a handful of solar panels on a handful of buildings. That their contribution almost certainly is as negative for the climate as it is for the economy, doesn't matter. We'll get some academic programs, which as always will support research into whatever can be twisted into sounding policy-compliant. And everything else continues on its old trajectory.

Poland is the second largest coal consumer in Europe. Interestingly, since the reason they are number two, is Germany begin number one. And, ironically, the panel would often point to Germany as and illustration of successful subsidies and policies favoring renewable energy.↩
Note that electricity is only a small part of total energy, when people talk about electricity generation, it is usually to make their favorite technology look better than it is. It sound better to say that solar power produces 1% of global electricity, than 0.2% of global energy, doesn't it?↩
As far as I can find, the largest solar park in Scandinavia is in Västerås. This is estimated to deliver 1.2GWh from 7000m² of photovoltaic panels over a 4.5 ha area. Compared to Topaz's 25 km², that's slightly less than 0.2% of the size and 0.1% of the power output. At SEK 20M, it's also about 0.1% of the cost, which is surprisingly inexpensive. But these numbers seem to be from the project itself, who at the same time claims the power suffices for "400 apartments". In my apartment, 3000kWh is just one or two winter months, which makes me a bit suspicious about the rest of the calculations. Another comparison could be Neuhardberg, at slightly less than € 300 million and 145MWp capacity, but which apparently only translates to 20GWh(?). If that is indeed correct, Poland would need seven thousand of those, at a € 2100 billion price tag.↩

Probabilities for heterozygote genetic markers in hybrids

Thu, 10 Mar 2016 20:00:00 UT

Nature is nothing if not flexible, and when the environment changes, plants and animals will try their best to adapt to new conditions. Migration can be one such adaptation, and one which sometimes brings previously separate populations into contact with each other. For example, we have seen recent examples of Antarctic minke whale migrating all the way to the North Atlantic to breed with the whales native to that region. We don't know precisely what drives this, but one likely candidate is changes in the ecosystem, perhaps caused by global warming.

In order to monitor this migration, we would like to use genetic markers to identify migrants and their offspring. Ideally, we want markers that are fully diagnostic, that is, they always give one value (or allele) in one population, and always a different value in the other. In practice, even for quite good markers there is often some occurrences of the foreign allele, and even if the marker appears fully diagnostic, we can't say for sure, since our testing will always be limited by our sampling. Usually, the best we can do is to quantify the allele frequencies as confidence intervals.

In order to classify hybrids, we need to know the probability for the different combinations of alleles (genotypes) in the various types of hybrids. To simplify the analysis, we limit our interest to the case where a migrant enters a native population and interbreeds with it, and where the offspring (referred to as the F1 generation) continues to interbreed with the native population, resulting in new generations (F2, F3, and so on) of back-crossed hybrids.

Assuming fully diagnostic markers

As mentioned, a marker is fully diagnostic if no allele occurs in both populations. If we restrict analysis to single nucleotide polymorphisms (SNPs), where there are two possible alleles per marker, all non-hybrid individuals are homozygote.

An F1 hybrid by necessity inherits one allele from each population, and thus it is always heterozygote. An F2 back-cross inherits one allele from the native population, and one from the F1 hybrid. The probability of heterozygosity is therefore the probability of inheriting the foreign allele (a) from the hybrid, i.e. 0.5. Similarly, the probability of homozygosity is the probability of inheriting the native allele (A), also 0.5. In general, the probability of retaining the foreign allele is halved for each subsequent back-cross.

If we label the native allele A, and the foreign allele a, we can list the probabilities for the different genotypes, as seen in Table diagnostic below.

Genotype probabilities with fully diagnostic markers for increasing generations of back-crossed hybrids.
BC Gen	P(AA)	P(Aa)	P(aa)
migrant	0	0	1
native	1	0	0
F1	0	1	0
F2	0.5	0.5	0
F3	0.75	0.25	0
:	:	:	:
Fn	1-2^1-n	2^1-n	0

Arbitrary allele frequencies

Although fully diagnostic markers are the ideal case, in practice the foreign allele often occurs in the native population, and vice versa. In any case, with limited testing we cannot ascertain that the markers are fully diagnostic; at best, we can give a confidence interval for the minor allele frequency.

To address this we let A and a no longer represent the actual allele values, but instead the allele origin. In other words, A means an allele inherited from the native population, and a represents an allele with foreign origin, inherited from its migrant ancestor. We see that we can then use Table diagnostic to determine the probability of a back-cross having two alleles from the native population, or retaining one allele from its migrant forebear.

Definition of allele frequencies in the two population.
	B	b
native	p_n	q_n
foreign	q_f	p_f

Now, we turn to the actual allele values. Let the alleles be labelled B and b, and allele frequencies defined as in Table freqs.

The probability of the two cases of allele heritage, and the associated probabilities for the possible genotypes.
Case	Probability	Genotype BB	Genotype Bb	Genotype bb
Two native alleles	1 − 2^1 − n	p_n²	2p_nq_n	q_n²
One native, one foreign	2^1 − n	p_nq_f	p_np_f + q_nq_f	q_np_f

Table probs combines the allele frequencies with Table diagnostic to calcuate the probability for the different genotypes. From this, we see that the probability of a heterozygote (genotype Bb) in an Fn hybrid is therefore:

If we assume that the population has the same minor allele frequency, that is p_n = p_f = p and q_n = q_f = q = 1 − p. From the relationship p = 1 − q, it follows that p² + q² = p² + (1 − p)² = 2p² − 2p + 1 = 1 − 2pq, and we get:

We observe here that if p = 1, the probability of a heterozygote is 2^1 − n, as in Table diagnostic, and as n increases, the heterozygote probability converges to 2pq, the heterozygote probability in the native population. Table probs gives the heterozygote probabilities for back-cross generations F1 to F10 under various minor allele frequencies.

Table of heterozygote probabilities given generation and MAF
Gen	0.1	0.05	0.025	0.01	0
1	0.820	0.905	0.951	0.980	1.000
2	0.500	0.500	0.500	0.500	0.500
3	0.340	0.298	0.274	0.260	0.250
4	0.260	0.196	0.162	0.140	0.125
5	0.220	0.146	0.105	0.080	0.063
6	0.200	0.120	0.077	0.050	0.031
7	0.190	0.108	0.063	0.035	0.016
8	0.185	0.101	0.056	0.027	0.008
9	0.183	0.098	0.052	0.024	0.004
10	0.181	0.097	0.051	0.022	0.002
native	0.180	0.095	0.049	0.020	0.000

When q is small, we can make the following approximation by ignoring q²:

Fixed markers in the native population

For the minke whale, we tested fifty markers on about 400 specimens. There are some cases of non-zero minor allele frequency in the Antarctic population, but the common minke population (Atlantic and Pacific) appears to be entirely homozygote for these markers. One possible explanation can be that the larger Antarctic population allows the maintenance of a wider genetic diversity. Since we are primarily interested in the introgression of (foreign) Antarctic minke into (native) common minke populations, we might assume that p_n = 1 and q_n = 0. In that case, we can simplify (1) and the probability of heterozygotes becomes:

Previous literature

As usual, I don't check the literature too closely before putting pen to paper. But Andersom and Thompson give an overview of various methods, and is probably a good starting point. Many of the methods mentioned depend on fully diagnostic markers, and many apply to a limited number of generations. Some methods attempt to identify hybrids without known allele frequencies in the native populations -- this is an implicit requirement in our analysis above. We have also used more-or-less standard classification methods (e.g., programs like Structure and Geneclass), but as I understand it, these only look at allele frequencies, and don't take into account the special distribution of genotypes (i.e., increased number of heterozygotes) that is particular to hybrids.

Acknowledgments

Thanks to Hans J. Skaug for helping out with the math. I am still to blame for any remaining errors, of course - if you find any, I appreciate being made aware of them.

Can you trust science?

Wed, 25 Mar 2015 20:00:00 UT

Hardly a week goes by without newspaper writing about new and exciting results from science. Perhaps scientists have discovered a new wonderful drug for cancer treatment, or maybe they have found a physiological cause for CFS. Or perhaps this time they finally proved that homeopathy works? And in spite of these bold announcements, we still don't seem to have cured cancer. Science is supposed to be the method which enables us to answer questions about how the world works, but one could be forgiven for wondering whether it, in fact, works at all.

As my latest contribution to my local journal club, I presented a paper by Ioannidis, titled Why most published research findings are false ¹. This created something of a stir when it was published in 2005, because it points out some simple mathematical reasons why science isn't as accurate as we would like to believe.

The ubiquitous p-value

Science is about finding out what is true. For instance, is there a relationship between treatment with some drug and the progress of some disease - or is there not? There are several ways to go about finding out, but in essence, it boils down to making some measurements, and doing some statistical calculations. Usually, the result will be reported along with a p-value, which is a by-product of the statistical calculations saying something about how certain we are of the results.

Specifically, if we claim there is a relationship, the associated p-value is the probability we would make such a claim even if there is no relationship in reality.

We would like this probability to be low, of course, and since we usually are free to select the p-value threshold, it is usually chosen to be 0.05 (or 0.01), meaning that if the claim is false, we will only accept it 5% (or 1%) of the times.

The positive predictive value

Now, the p-value is often interpreted as the probability of our (positive) claim being wrong. This is incorrect! There is a subtle difference here, which it is important to be aware of. What you must realize, is that the probability α relies on the assumption that the hypothesis is wrong - which may or may not be true, we don't know (which is precisely why we want to find out).

The probability of a claim being wrong after the fact is called the positive predictive value (PPV). In order to say something about this, we also need to take into account the probability of claiming there exists a relationship when the claim is true. Our methods aren't perfect, and even if a claim is true, we might not have sufficient evidence to say for sure.

So, take one step back and looking at our options. Our hypothesis (e.g., drug X works against disease Y) can be true or false. In either case, our experiment and analysis can lead us to reject or accept it with some probability. This gives us the following 2-by-2 table:

	True	False
Accept	1-β	α
Reject	β	1-α

Here, α is the probability of accepting a false relationship by accident (i.e., the p-value), and β is the probability of missing a true relationship -- we reject a hypothesis, even when it is true.

To see why β matters, consider a hypothetical really really poor method, which has no chance of identifying a true relationship, in other words, $\beta$=1. Then, every accepted hypothesis must come from the False column, as long as α is at all positive. Even if the p-value threshold only accepts 1 in 20 false relationships, that's all you will get, and as such, they constitute 100% of the accepted relationships.

But looking at β is not sufficient either. Let's say a team of researchers test hundreds of hypotheses, which all happen to be false? Then again, some of them will get accepted anyway (sneaking in under the p-value threshold α), and since there are no hypotheses in the True column, again every positive claim is false.

A β of 1 or a field of research with 100% false hypotheses are extreme cases², and in reality, things are not quite so terrible. The Economist had a good article with a nice illustration showing how this might work in practice with more reasonable numbers. It should still be clear that the ratio of true to false hypotheses being tested, as well as the power of the analysis to identify true hypotheses are important factors. And if these numbers approach their limits, things can get quite bad enough.

More elaborate models

Other factors also influence the PPV. Try as we might to be objective, scientists often try hard to find a relationship -- that's what you can publish, after all³. Perhaps in combination with a less firm grasp of statistics than one could wish for (and scientists who think they know enough statistics are few and far between - I'm certainly no exception there), this introduces bias towards acceptance.

Multiple teams pursuing the same challenges in a hot and rapidly developing field also decrease the chance of results being correct, and there's a whole cottage industry of scientist reporting spectacular and surprising results in high-ranking journals, followed by a trickle of failures to replicate.

Solving this

One option is to be stricter - this is the default when you do multiple hypothesis testing, you require a lower p-value threshold in order to reduce α. The problem with this is that if you are stricter with what you accept as true, you will also reject more actually true hypotheses. In other words, you can reduce α, but only at the cost of increasing β.

On the other hand, you can reduce β by running a larger experiment. One obvious problem with this is cost, for many problems, a cohort of a hundred thousand or more is necessary, and not everybody can afford to run that kind of studies. Perhaps even worse, a large cohort means that almost any systematic difference will be found significant. Biases that normally are negligible will show up as glowing bonfires in your data.

In practice?

Modern biology has changed a lot in recent years, and today we are routinely using high-throughput methods to test the expression of tens of thousands of genes, or the value of hundreds of thousands of genetic markers.

In other words, we simultaneously test an extreme number of hypotheses, where we expect a vast majority of them to be false, and in many cases, the effect size and the cohort are both small. It's often a new and exciting field, and we usually strive to use the latest version of the latest technology, always looking for new and improved analysis tools.

To put it bluntly, it is extremely unlikely that any result from this kind of study will be correct. Some people will claim these methods are still good for "hypothesis generation", but Ioannidis shows a hypothetical example where a positive result increases the likelihood that a hypothesis is correct by 50%. This doesn't sound so bad, perhaps, but in reality, the likelihood is only improved from 1 in 10000 to 1 in 7000 or so. I guess three thousand fewer trials to run in the lab is something, but you're still going to spend the rest of your life running the remaining ones.

You might expect scientists to be on guard for this kind of thing, and I think most scientists will claim they desire to publish correct results. But what counts for your career is publications and citations, and incorrect results are no less publishable than correct ones - and might even get cited more, as people fail to replicate them. And as you climb the academic ladder, publications in high-ranking journals is what counts, an for that you need spectacular results. And it is much easier to get spectacular incorrect results than spectacular correct ones. So the academic system rewards and encourages bad science.

Consequences

The bottom line is to be skeptical of any reported scientific results. The ability of the experiment and analysis to discover true relationships is critical, and one should always ask what the effect size is, and what the statistical power -- the probability of detecting a real effect -- is.

In addition, the prior probability of the hypothesis being true is crucial. Apparently-solid, empirical evidence of people getting cancer from cell phone radiation, or working homeopathic treatment of disease can almost be dismissed out of hand - there simply is no probable explanation for how that would work.

A third thing to look out for, is how well studied a problem is, and how the results add up. For health effects of GMO foods, there is a large body of scientific publications, and an overwhelming majority of them find no ill effects. If this was really dangerous, wouldn't some of these investigations show it conclusively? For other things, like the decline of honey bees, or the cause of CFS, there is a large body of contradictory material. Again - if there was a simple explanation, wouldn't we know it by now?

And since you ask: No, the irony of substantiating this claim with a scientific paper is not lost on me.↩
Actually, I would suggest that research in paranormal phenomena is such a field. They still manage to publish rigorous scientific works, see this Less Wrong article for a really interesting take.↩
I think the problem is not so much that you can't publish a result claiming no effect, but that you can rarely claim it with any confidence. Most likely, you just didn't design your study well enough to tell.↩

Thoughts on phylogenetic trees

Sun, 15 Feb 2015 20:00:00 UT

An important aspect of studying evolution is the construction of phylogenetic trees, graphically representing the relationship between current and historic species. These trees are usually calculated based on similarities and differences between genetic material of current species, and one particular challenge is that the topology of the resulting trees depend on the selection of genes used to construct them. Quite often, the species tree based on one set of genes differ substantially from the tree based on another set of genes.

The phylogenetic tree is usually presented as a simple tree of species. The end points of brances at the bottom of the tree (leaves) represent current species, and branching points higher up (internal nodes) represent the most recent common ancestor, or MRCA, for the species below it.

A very simple example could look something like this:

Evolution as a tree of species

Here you have two current species, and you can trace back their lineage to a MRCA, and further back to some ancient origin. Varying colors indicate that gradual change along the branches has introduced differences, and that the current species now have diverged from each other, and their origin.

This representation has the advantage of being nice and simple, and the disadvantage of being misleading. For instance, one might get the impression that a species is a well-defined concept, and ask questions like: when the first member of a species diverged from its ancestor species, how did it find a mate?

Species are populations

But we are talking about species here - that is, not individuals but populations of individuals. So a more accurate representation might look like this:

Evolution as a tree of populations

Circles now represent individuals, and it should perhaps be clearer that there is no such thing as the “first” of anything. At the separation point, there is no difference between the two populations, and it is only after a long period of separation that differences can arise. (Of course, if there are selective pressure favoring specific properties - perhaps redness is very disadvantageous for species B, for instance - this change will be much quicker. Look at how quickly we have gotten very different breeds of dogs by keeping populations artificially separate, and selecting for specific properties.)

The so-called “speciation” is nothing more than a population being split into two separate parts. Typically, this can be geographically - a few animals being carried to Madagascar on pieces of driftwood - but anything that prevents members of one branch from mating with members of the other one will suffice. At he outset, the two branches are just indistinguishable subpopulations of the same species, but if the process goes on long enough, differences between the two populations can become large enough that they can no longer interbreed, and we can consider them different species.

In practice, such a separation is often not complete, some individuals can stray between the groups. In that case, speciation is less likely to happen, since the property of being unable to breed with the other group represents a reproductive disadvantage, and it would therefore be selected against. In other words, if your neighbor is able to have children with more members of the population than you, his chances of having children are better than yours. Your genes get the short end of the stick. Kind of obvious, no?

Populations consist of genes

But we can also view evolution as working on populations, not of individuals, but of individual genes. This is illustrated in the next picture:

Evolution as a tree of gene populations

The colored circles now represent genes, and an individual is here just a sample from the population of genes - illustrated by circling three gene circles. (Note that by “gene”, we here mean an abstract unit of inheritance. In other fields of biology, the word might be synonymous with a genetic region that codes for a protein, or is transcribed to (possibly non-coding) RNA.)

Here, we see that although the genes themselves do not change (in reality they are subject to mutations), the availability of the different genes vary over time, and some might disappear from one of the branches entirely - like red genes from species B here. This kind of genetic drift can still cause distinct changes in individuals.

Ancestry of individual genes

Each individual typically gets half its genes from each parent, one fourth from each grandparent, and so on, so after a few generations, all genes come from essentially different ancestors. This means you can calculate the MRCA for each gene individually, and this is exactly what has been done to estimate the age our “mitochondrial Eve” and “Y-chromosomal Adam”. Here is the example lineage for the green gene:

Lineage and MRCA for the green gene

We see that the green-gene MRCA is much older than the speciation event. In addition, each gene has its unique history. This means that when we try to compute the MRCA, different genes will give different answers, and it can be difficult to construct a sensible consensus.

For example, bits of our genome appear to come from the Neanderthal, and those bits will have a MRCA that predates the time point where Neanderthal branched from H. sapiens (possibly 1-2 million years ago). (Interestingly, “Adam” and “Eve” are both estimated to have lived in the neighborhood of 200000 years ago. This means that although 20% of the Neanderthal genome is claimed to have survived, all Neanderthal mitochondria and Y-chromosomes have been eradicated.)

Information content and allele frequency difference

Thu, 17 Jul 2014 12:00:00 UT

Just a quick note on the relationship between ESI scores and allele frequencies. Allele frequency differences is of course related to – perhaps even the definition of – diversification, but the information we gain from observing an allele also depends on the specific allele frequencies involved. The graph below shows how this is related.

Each line represents a fixed allele difference, from 0.05 at the bottom, to 0.95 at the top, and the x-axis is the average allele frequency between populations. We see that for small differences, the actual frequencies matter little, but for moderate to large allele differences, allele frequencies near the extremes have a large effect.

Note that this is information per allele, and thus not ESI (which is the expected information from observing the site, in other words a weighted average over all alleles).

Expected site information from SNPs

Wed, 02 Jul 2014 08:00:00 UT

Lately, I’ve been working on selecting SNPs, the main goal is often to classify individuals as belonging to some specific population. For instance, we might like to genotype a salmon to see if it is from the local population or an escapee from a sea farm, or perhaps a migrant from a neighboring river? And if it’s an escapee, we might want to know which farm it escaped from. In short, we want to find SNPs that are diagnostic.

Typically, this is done by sequening pools of individuals, mapping the reads to the reference genome, identifying variant positions, and ranking them - typically using F_ST, sometimes also using p-values for the confidence in an actual allele difference, and maybe filtering on sequencing coverage and base- or mapping quality. However, F_ST really isn’t a suitable tool for this purpose. I’m therefore proposing the following. Let me know if it makes sense or not.

Expected Site Information

For diagnostic SNP, what we really would like to know is the amount of information observing each site contributes. Using Bayes theorem, observing an allele a in some individual N, gives us the following posterior probability for N belonging to some population A, where the allele frequency, P(a∣A), is known:

P(A|a) = P(a|A)P(A)/P(a)

Here, P(A) is our prior probability of N belonging to A, which after observing a is modified by a factor of

P(a|A)/P(a)

In order to assign N to one of several populations (either (A) or B, say), we are interested in the relative probabilities for the two hypotheses. In other words, we would like the odds for N belonging to one population or the other. Given the probabilities of P(a∣A) and (P(a|B)), and initial odds (P(A)/P(B)), we get

P(A|a)/P(B|a) = [P(a|A)P(A)/P(a)]/[P(a|B)P(B)/P(a)]

Canceling out P(a), we find that the prior odds are modified by:

P(a|A)/P(a|B)

That is, the ratio of this allele’s frequencies in each of the populations. For practical reasons, it is common to take the logarithm of the odds. This gives us scores that are additive and symmetric (so that switching the two populations gives us the same score with the opposite sign). Specifically, base two logarithms will give us the score in bits.

When observing a site, we may of course also encounter the alternative allele. By the same reasoning as above, we find that this allele modifies the odds by

[1-P(a|A)]/[1-P(a|B)]

Lacking any prior information, we can consider each population equally likely, and the likelihood of observing a particular allele is the average of the likelihood in each population. The information gain from each possible allele is then averaged, weighted by this average likelihood. For a biallelic site with major allele frequencies p and (q) (and consequentially, minor allele frequencies of 1 − p and (1-q)) in the two populations, the expected added information from the site then becomes:

I(p,q) = |(p+q)/2 log_2(p/q)| + |(1-(p+q)/2)log_2((1-p)/(1-q)) |

Note that we are here only interested in the amount of information gained, regardless of which hypothesis it favors, and thus we take the absolute values. For a site with multiple alleles enumerated by i and with frequency vectors p and q in the two populations, this generalizes to the weighted sum of log₂(p_i/q_i).

Unlike measures like F_ST, measures of I is additive (assuming independence between sites), so the information gained from observing mulitple sites is readily calculated. From observing the information gained from observing each site, we will also be able to compare different sets of sites, and e.g., compare the value of a single site with minor allele frequencies (MAF) of, say, 0.1 and 0.3 to two sites with MAF of 0.2 and 0.3.

It may also be instructive to compare this procedure to sequence alignment and position specific score matrices (PSSMs). In sequence alignment, a sequence of nucleotides or amino acids are scored by comparing its match to a target sequence to its match to some base model using log odds scores. The base model to compare against is often implicit (typically using sequences of random composition), but more elaborate models are also possible Similarly, position specific frequency matrices are often converted to position specific score matrices using log odds. Calculating the information value from a set of observed alleles is then analogous to scoring an “alignment” of the set of observed alleles to two different sets of allele frequencies.

Allele frequency confidence intervals

In order to apply the above method in practice, we need to measure the allele frequencies in the population. This is problematic for two reasons. First, we do not have precise knowledge of the allele frequencies, we can only estimate them from our sequenced sample. This introduces sampling bias. Second, the sequencing process introduces additional artifacts. For instance, sequencing errors often result in substitutions, which are observed as apparent alleles. In addition, sequences can be incorrectly mapped, contain contamination, the reference genome can contain collapsed repeats, and the chemistry of the sequencing process is usually also biased – for instance, coverage is often biased by GC content. These artifacts often give the false appearance of variant positions.

One challenge with calculating site information from sequencing data (as opposed to using allele frequencies directly), is that such errors in the data can vastly overestimate the information content. For instance, an allele that appears to be fixed in one population means that any other observed allele will assign the individual to the alternative population - regardless of any other alleles. It is easy to see that an allele frequency of zero results in the odds going either to zero or infinity, and thus the log odds will go to either positive or negative infinity.

For diagnostic SNP discovery, it is more important to ensure that identified SNPs are informative, than to precisely estimate the information content. Thus, we take a conservative approach and use upper and lower limits for the allele frequencies by calculating confidence intervals using the method by Agresti-Coull. In addition, the limits are also adjusted by a factor ε, corresponding to sequencing error rate.

Software implementation

I’ve implemented this (that is, the conservative measure) as described above in a software tool called varan. It parses sequence alignments in the standard “mpileup” format as output by the samtools mpileup command. It can currently output several different statistics and estimators, including expected site information. This is a work in progress, so please get in touch if you wish to try it out.

Big data revisited

Mon, 05 May 2014 21:00:00 UT

Big Data - again

I’ve recently (and much belatedly) started to get involved in the functional programming community in Munich. This encompasses:

The Munich Haskell Meeting, where a group of us get together once a month to socialize and chat about more or less FP related topics.
The Haskell Hackathons, a smaller get-together to learn and explore more advanced Haskell. What little I know about free monads, I owe to these guys.
Munich Lambda, a more diverse FP-oriented bunch, where I went to hear Andres Löh’s very nice talk on parallel programming.

Monday before Easter, I went to the MHM, which included an interesting talk by Rene Brunner about Big Data. Which made me think a bit. There are many definitions of “big data” (and Rene provided several of them), Wikipedia suggests:

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

I will define “big data” to be the point where a quantitative increase in problem size forces a qualitative change in how we approach them. In other words, problem sizes grow continously, but at some point, we can no longer keep up, and will have to rethink our methods and tools. Think of it as a paradigm shift in the way we do analysis.

A brief history of…me

I started out working in bioinformatics some years ago, at the Data Centre of a research institute. This department was using traditional data centre technology, meaning SQL data bases, sometimes hooked up to automatic data collection equipment, sometimes with web-based wrapping written in PHP or Java, often both.

When I arrived, bioinformatics was something new, and there was a strong inclination to shoehorn these new data into the same kind of infrastructure. My sentiment was that this was, perhaps, not the best way to go about it. In contrast to most other data, the bioinformatics data had:

larger data sizes.
heterogenous and non-tabular data
many specialized tools

In other words, bioinformatics represented a “big data” paradigm shift for the data centre. SQL databases were no longer very useful the primary storage (but sometimes used as an application optimization).

The next paradigm shift

Now, we’ve all seen the graphs showing how the capacity of sequencing technology is growing exponentially, and at a much faster rate than Moore’s law. In other words, the computational power to data size ratio is diminishing, and even for algorithms that scale linearly with input size (and that’s pretty optimistic for an algorithm), compute power is becoming a scarce resource. Again, we can’t keep up.

And although people have been knowing and showing this for years, somehow there’s always money for equipping labs with more sequencers, and grants for new sequencing projects. All too often, the analysis is just added as an afterthought.

And if you allow me to digress, this makes me wonder: what if the tables were turned, if it were the other way around? What if I got to tell my biologist colleagues that, hey, I just got money to buy yet another new supercomputer with the capacity to analyze the expression data for every human gene - now I need you to go to the lab and do twenty five thousand qPCR runs for each of these twenty samples. Could you have it done by Monday? And by the way, we’re hiring some more computer scientists.

Scaling

I’ve previously argued that our methods in bioinformatics is not quite where we would want them to be, and in many cases, they tend to produce outright incorrect results. But in addition, I think they no longer scale.

Take the de novo genome project, for instance. For something bigger (genome-wise, of course) than a bacterium, assembly of a de novo genome is a major undertaking. Sequencing is only a small fraction of the cost, and you really need a team of experienced scientists and engineers to produce something useful and reliable. And lots and lots of computation.

And although producing curated “resources” (like genome assemblies) can be useful and get you published, often there are more specific goals we want to achieve. Perhaps we are only interested in some specific aspect of the species, maybe we want to design a vaccine, or study the population structure, or find a genetic sex marker.

Lazy vs strict analysis

Depending on a curated resources for analysis is an example of an eager approach. This is fine if you work on the handful of species where you have a large and expensive consortium that already created the curated resource for you. And of course, someone nearby who can afford a supercomputing facility.

But for the remaining species, this isn’t going to work. Even if somebody decided to start prioritizing the computational and analytical aspects of biology, it would still consume a lot of valuable resources. Instead of just pouring money over compute infrastructure, we need to be more efficient. That means new algorithms and, I think, new approaches to how we answer scientific questions.

I don’t have all the answers to this, but rethinking how we do analysis could be a start. As programmers often forget, the best optimization to avoid doing the unnecessary bits. A starting point would be to exchange the eager approach for a lazy (or at least non-strict one. Instead of starting with the data and asking what we can do, we need to start with the question, and then ask how we can answer it from the data.

Why it’s not going to happen

Writing this, I realize I have said it all before. It is easy to blame the non-progress on backwards grant agencies, biologist-trained management, stubborn old professors trapped in paradigms of the past, or any other forms of force majeure. But there is a fundamental problem with this line of thought.

Although compute resources are quickly becoming scarce, another resource – the bioinformaticians – is even scarcer. And instead of churning through a slow, expensive, and error prone stack of tools to generate a sequence of intermediate products (genome assembly, gene prediction, functional annotation, etc), and then, finally, looking up the desired result in a table, I believe a lazy approch will need to tailor the methods to the question. In other words, scientists will have to write programs.