Home > The Art of Statistics How to Learn from Data(42)

The Art of Statistics How to Learn from Data(42)
Author: David Spiegelhalter

Rather a lot, it turns out. Fisher envisaged the sort of situation seen in the early examples in this chapter, with a single set of data, a single summary outcome measure and a single test of compatibility. But in the last few decades P-values have become the currency of research, with vast numbers appearing in the scientific literature—a study scraped around 30,000 t-statistics and their accompanying P-values from just three years of papers in eighteen psychology and neuroscience journals.8

So let’s see what we would expect to happen with, say, 1,000 studies, each designed with size 5% (α) and 80% power (1−β), noting that in practice most studies would have considerably less than 80% power. In the real world of research, although experiments are carried out in the hope of making a discovery, it is recognized that most null hypotheses are (at least approximately) true. So suppose only 10% of the null hypotheses tested were actually false: even this figure is probably rather high for new pharmaceuticals, which have a notoriously low success rate. Then, in a similar way to the screening examples in Chapter 8, Figure 10.5 shows the frequencies for what we expect to happen for these 1,000 studies.

This reveals that we expect to claim 125 ‘discoveries’, but of these 45 are false-positives: in other words 36%, or over a third, of the rejected null hypotheses (the ‘discoveries’) are incorrect claims. This rather gloomy picture becomes even worse when we consider what actually ends up in the scientific literature, since journals are biased towards publishing positive results. A similar analysis of scientific studies led to Stanford professor of medicine and statistics John Ioannidis’s famous claim in 2005 that ‘most published research findings are false’.9 We shall return to the reasons for his dismal conclusion in Chapter 12.

 

 

Figure 10.5

The expected frequencies of the outcomes of 1,000 hypothesis tests carried out with size 5% (Type I error, α) and 80% power (1 − Type II error, 1−β). Only 10% (100) of the null hypotheses are false, and we correctly detect 80% of them (80). Of the 900 null hypotheses that are true, we incorrectly reject 45 (5%). Overall, of 125 ‘discoveries’, 36% (45) are false discoveries.

 

 

Since all these false discoveries were based on a P-value identifying a ‘significant’ result, P-values have been increasingly blamed for a flood of incorrect scientific conclusions. In 2015 a reputable psychology journal even announced that they would ban the use of NHST (Null Hypothesis Significance Testing). Finally in 2016 the American Statistical Association (ASA) managed to get a group of statisticians to agree on six principles about P-values.*

The first of these principles simply points out what P-values can do:

1. P-values can indicate how incompatible the data are with a specified statistical model.

 

As we have repeatedly seen, P-values do this by essentially measuring how surprising the data are, given a null hypothesis that something does not exist. For example, we ask whether the data are incompatible with a drug that doesn’t work? The logic can be tricky, but useful.

The second principle tries to remedy errors in their interpretation:

2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

 

Back in Chapter 8, we were very careful to distinguish appropriate conditional probability statements such as ‘only 10% of women without breast cancer would get a positive mammogram’ from the incorrect ‘only 10% of women with a positive mammogram do not have breast cancer’. This was the mistake known as the prosecutor’s fallacy, and we saw there are neat ways of remedying this error by thinking of what we might expect to happen to 1,000 women being tested.

Similar problems can occur with P-values, which measure the chance of such extreme data occurring, if the null hypothesis is true, and do not measure the chance that the null hypothesis is true, given that such extreme data have occurred. This is a subtle but essential difference.

When the CERN teams reported a ‘five-sigma’ result for the Higgs boson, corresponding to a P-value of around 1 in 3.5 million, the BBC reported the conclusion correctly, saying this meant ‘about a one-in-3.5 million chance that the signal they see would appear if there were no Higgs particle.’ But nearly every other outlet got the meaning of this P-value wrong. For example, Forbes Magazine reported, ‘The chances are less than 1 in a million that it is not the Higgs boson’, a clear example of the prosecutor’s fallacy. The Independent was typical in claiming that ‘there is less than a one in a million chance that their results are a statistical fluke’. This may not be blatantly mistaken as Forbes, but it is still assigning the small probability to ‘their results are a statistical fluke’, which is logically the same as saying this is the probability of the null hypothesis being tested. That’s why the ASA try to emphasize that the P-value is not ‘the probability that the data were produced by random chance alone’.

The ASA’s third principle seeks to counter the obsession with statistical significance:

3. Scientific conclusions and business or policy decisions should not be based only on whether a P-value passes a specific threshold.

 

When Ronald Fisher started publishing tables showing values of statistics that would just make the results ‘P < 0.05’ or ‘P < 0.01’, he presumably had little idea of how such rather arbitrary thresholds would come to dominate scientific publications, with all results tending to be separated into ‘significant’ or ‘not significant’. From there it is a short step to consider ‘significant’ results as proven discoveries, producing an oversimplified and dangerous precedent for going from data straight to conclusions without pausing for thought on the way.

A dire consequence of this simple dichotomy is the misinterpretation of ‘not significant’. A non-significant P-value suggests the data are compatible with the null hypothesis, but this does not mean the null hypothesis is precisely true. After all, just because there’s no direct evidence that a criminal was at the scene of a crime, that does not mean he is innocent. But this mistake is surprisingly common.

Consider the major scientific dispute about whether a small amount of alcohol, say one drink a day, is good for you. A study claimed that only older women might benefit from moderate alcohol consumption, but close inspection revealed that other groups also showed a benefit, but that it was not statistically significant, since the confidence intervals around the estimated benefit in these groups were very wide indeed. Although the confidence intervals included 0 and hence the effects were not statistically significant, the data were fully compatible with the 10% to 20% reduction in mortality risk that had been previously suggested. But The Times trumpeted that ‘Alcohol Has No Health Benefits After All’.10

To summarize, it is very misleading to interpret ‘not significantly different from 0’ as meaning that the true effect actually was 0, particularly in smaller studies with low power and wide confidence intervals.

The ASA’s fourth principle sounds fairly innocuous:

4. Proper inference requires full reporting and transparency.

 

The most obvious need is to clearly report how many tests were actually done, so if the most significant result is being emphasized, we can apply some form of adjustment such as Bonferroni. But the problems with selective reporting can be much more subtle than this, as we shall see in the next chapter. Only by knowing the plan of the study, and what was actually done, can problems with P-values be avoided.

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)