Can you trust science?

by Ketil Malde; March 25, 2015

Hardly a week goes by without newspaper writing about new and exciting results from science. Perhaps scientists have discovered a new wonderful drug for cancer treatment, or maybe they have found a physiological cause for CFS. Or perhaps this time they finally proved that homeopathy works? And in spite of these bold announcements, we still don't seem to have cured cancer. Science is supposed to be the method which enables us to answer questions about how the world works, but one could be forgiven for wondering whether it, in fact, works at all.

As my latest contribution to my local journal club, I presented a paper by Ioannidis, titled Why most published research findings are false 1. This created something of a stir when it was published in 2005, because it points out some simple mathematical reasons why science isn't as accurate as we would like to believe.

The ubiquitous p-value

Science is about finding out what is true. For instance, is there a relationship between treatment with some drug and the progress of some disease - or is there not? There are several ways to go about finding out, but in essence, it boils down to making some measurements, and doing some statistical calculations. Usually, the result will be reported along with a p-value, which is a by-product of the statistical calculations saying something about how certain we are of the results.

Specifically, if we claim there is a relationship, the associated p-value is the probability we would make such a claim even if there is no relationship in reality.

We would like this probability to be low, of course, and since we usually are free to select the p-value threshold, it is usually chosen to be 0.05 (or 0.01), meaning that if the claim is false, we will only accept it 5% (or 1%) of the times.

The positive predictive value

Now, the p-value is often interpreted as the probability of our (positive) claim being wrong. This is incorrect! There is a subtle difference here, which it is important to be aware of. What you must realize, is that the probability α relies on the assumption that the hypothesis is wrong - which may or may not be true, we don't know (which is precisely why we want to find out).

The probability of a claim being wrong after the fact is called the positive predictive value (PPV). In order to say something about this, we also need to take into account the probability of claiming there exists a relationship when the claim is true. Our methods aren't perfect, and even if a claim is true, we might not have sufficient evidence to say for sure.

So, take one step back and looking at our options. Our hypothesis (e.g., drug X works against disease Y) can be true or false. In either case, our experiment and analysis can lead us to reject or accept it with some probability. This gives us the following 2-by-2 table:

True False
Accept 1-β α
Reject β 1-α

Here, α is the probability of accepting a false relationship by accident (i.e., the p-value), and β is the probability of missing a true relationship -- we reject a hypothesis, even when it is true.

To see why β matters, consider a hypothetical really really poor method, which has no chance of identifying a true relationship, in other words, $\beta$=1. Then, every accepted hypothesis must come from the False column, as long as α is at all positive. Even if the p-value threshold only accepts 1 in 20 false relationships, that's all you will get, and as such, they constitute 100% of the accepted relationships.

But looking at β is not sufficient either. Let's say a team of researchers test hundreds of hypotheses, which all happen to be false? Then again, some of them will get accepted anyway (sneaking in under the p-value threshold α), and since there are no hypotheses in the True column, again every positive claim is false.

A β of 1 or a field of research with 100% false hypotheses are extreme cases2, and in reality, things are not quite so terrible. The Economist had a good article with a nice illustration showing how this might work in practice with more reasonable numbers. It should still be clear that the ratio of true to false hypotheses being tested, as well as the power of the analysis to identify true hypotheses are important factors. And if these numbers approach their limits, things can get quite bad enough.

More elaborate models

Other factors also influence the PPV. Try as we might to be objective, scientists often try hard to find a relationship -- that's what you can publish, after all3. Perhaps in combination with a less firm grasp of statistics than one could wish for (and scientists who think they know enough statistics are few and far between - I'm certainly no exception there), this introduces bias towards acceptance.

Multiple teams pursuing the same challenges in a hot and rapidly developing field also decrease the chance of results being correct, and there's a whole cottage industry of scientist reporting spectacular and surprising results in high-ranking journals, followed by a trickle of failures to replicate.

Solving this

One option is to be stricter - this is the default when you do multiple hypothesis testing, you require a lower p-value threshold in order to reduce α. The problem with this is that if you are stricter with what you accept as true, you will also reject more actually true hypotheses. In other words, you can reduce α, but only at the cost of increasing β.

On the other hand, you can reduce β by running a larger experiment. One obvious problem with this is cost, for many problems, a cohort of a hundred thousand or more is necessary, and not everybody can afford to run that kind of studies. Perhaps even worse, a large cohort means that almost any systematic difference will be found significant. Biases that normally are negligible will show up as glowing bonfires in your data.

In practice?

Modern biology has changed a lot in recent years, and today we are routinely using high-throughput methods to test the expression of tens of thousands of genes, or the value of hundreds of thousands of genetic markers.

In other words, we simultaneously test an extreme number of hypotheses, where we expect a vast majority of them to be false, and in many cases, the effect size and the cohort are both small. It's often a new and exciting field, and we usually strive to use the latest version of the latest technology, always looking for new and improved analysis tools.

To put it bluntly, it is extremely unlikely that any result from this kind of study will be correct. Some people will claim these methods are still good for "hypothesis generation", but Ioannidis shows a hypothetical example where a positive result increases the likelihood that a hypothesis is correct by 50%. This doesn't sound so bad, perhaps, but in reality, the likelihood is only improved from 1 in 10000 to 1 in 7000 or so. I guess three thousand fewer trials to run in the lab is something, but you're still going to spend the rest of your life running the remaining ones.

You might expect scientists to be on guard for this kind of thing, and I think most scientists will claim they desire to publish correct results. But what counts for your career is publications and citations, and incorrect results are no less publishable than correct ones - and might even get cited more, as people fail to replicate them. And as you climb the academic ladder, publications in high-ranking journals is what counts, an for that you need spectacular results. And it is much easier to get spectacular incorrect results than spectacular correct ones. So the academic system rewards and encourages bad science.

Consequences

The bottom line is to be skeptical of any reported scientific results. The ability of the experiment and analysis to discover true relationships is critical, and one should always ask what the effect size is, and what the statistical power -- the probability of detecting a real effect -- is.

In addition, the prior probability of the hypothesis being true is crucial. Apparently-solid, empirical evidence of people getting cancer from cell phone radiation, or working homeopathic treatment of disease can almost be dismissed out of hand - there simply is no probable explanation for how that would work.

A third thing to look out for, is how well studied a problem is, and how the results add up. For health effects of GMO foods, there is a large body of scientific publications, and an overwhelming majority of them find no ill effects. If this was really dangerous, wouldn't some of these investigations show it conclusively? For other things, like the decline of honey bees, or the cause of CFS, there is a large body of contradictory material. Again - if there was a simple explanation, wouldn't we know it by now?


  1. And since you ask: No, the irony of substantiating this claim with a scientific paper is not lost on me.

  2. Actually, I would suggest that research in paranormal phenomena is such a field. They still manage to publish rigorous scientific works, see this Less Wrong article for a really interesting take.

  3. I think the problem is not so much that you can't publish a result claiming no effect, but that you can rarely claim it with any confidence. Most likely, you just didn't design your study well enough to tell.

comments powered by Disqus
Feedback? Please email ketil@malde.org.