Home > The Art of Statistics How to Learn from Data(39)

The Art of Statistics How to Learn from Data(39)
Author: David Spiegelhalter


In Chapter 6 we saw that an algorithm might win a prediction competition by a very small margin. When predicting the survival of the Titanic test set, for example, the simple classification tree achieved the best Brier score (average mean squared prediction error) of 0.139, only slightly lower than the score of 0.142 from the averaged neural network (see Table 6.4). It is reasonable to ask whether this small winning margin of −0.003 is statistically significant, in the sense of whether or not it could be explained by chance variation.

This is straightforward to check, and the t-statistic turns out to be −0.54, with a two-sided P-value of 0.59.* So there is no good evidence that the classification tree is truly the best algorithm! This type of analysis is not routine in Kaggle-type competitions, but it seems important to know that the winning status depends on the chance selection of cases in the test set.

Researchers spend their lives scrutinizing the type of computer output shown in Table 10.5, hoping to see the twinkling stars indicating a significant result which they can then feature in their next scientific paper. But, as we now see, this sort of obsessive searching for statistical significance can easily lead to delusions of discovery.

 

 

The Danger of Carrying Out Many Significance Tests


The standard thresholds for declaring ‘significance’, P < 0.05 and P < 0.01, were fairly arbitrary choices by Ronald Fisher for his tables, back in the days when calculating exact P-values was not possible using the mechanical and electrical calculators available. But what happens when we run many significance tests, each time looking to see if our P-value is less than 0.05?

Suppose a drug truly does not work; that the null hypothesis is true. If we do one clinical trial, we will declare the result as statistically significant if the P-value is less than 0.05 and, since the drug is ineffective, the chance of this happening is 0.05 or 5%—that is the definition of a P-value. This would be considered a false-positive result, since we incorrectly believe the drug is effective. If we do two trials, and look at the most extreme, the chance of getting at least one significant—and hence false-positive—result is close to 0.10 or 10%.* The chance of getting at least one false-positive result increases quickly as we do more trials; if we do ten trials of useless drugs the chance of getting at least one significant at P < 0.05 gets as high as 40%. This is known as the problem of multiple testing, and occurs whenever many significance tests are carried out and then the most significant result is reported.

A particular problem occurs when researchers split data up into many subsets, do a hypothesis test in each, and then look at the most significant. A classic demonstration was an experiment carried out by reputable researchers in 2009 which involved showing a subject a series of photographs of humans expressing different emotions, and carrying out brain imaging (fMRI) to see which regions of the subject’s brain showed a significant response, taking P < 0.001.

The twist was that the ‘subject’ was a 4lb Atlantic salmon, which ‘was not alive at the time of scanning’. Out of a total of 8,064 sites in the brain of this large dead fish, 16 showed a statistically significant response to the photographs. Rather than concluding the dead salmon had miraculous skills, the team correctly identified the problem of multiple testing—over 8,000 significance tests are bound to lead to false-positive results.4 Even using a stringent criterion of P < 0.001, we would expect 8 significant results by chance alone.

One way around this problem is to demand a very low P-value at which significance is declared, and the simplest method, known as the Bonferroni correction, is to use a threshold of 0.05/n, where n is number of tests done. So, for example, the tests at each site of the salmon’s brain could be carried out demanding a P-value of 0.05/8,000 = 0.00000625, or 1 in 160,000. This technique has become standard practice when searching the human genome for sites with association with diseases: since there are roughly 1,000,000 sites for genes, a P-value smaller than 0.05/1,000,000 = 1 in 20 million is routinely demanded before claiming a discovery.

So when large numbers of hypotheses are being tested at the same time, as in brain imaging or genomics, the Bonferroni method can be used to decide whether the most extreme findings are significant. Simple techniques have also been developed that slightly relax the Bonferroni criterion for the second most extreme result, the third most extreme and so on, that are designed to control the overall proportion of ‘discoveries’ that turn out to be false claims—the so-called false discovery rate.

Another way to avoid false-positives is to demand replication of the original study, with the repeat experiment carried out in entirely different circumstances, but with essentially the same protocol. For new pharmaceuticals to be approved by the US Food and Drug Administration, it has become standard that two independent clinical trials must have been carried out, each showing clinical benefit that is significant at P < 0.05. This means that the overall chance of approving a drug, that in truth has no benefit at all, is 0.05 × 0.05 = 0.0025, or 1 in 400.


5. Does the Higgs boson exist?

 


Throughout the twentieth century, physicists developed a ‘standard model’ intended to explain the forces operating at a subatomic level. One piece of the model remained an unproved theory: the ‘Higgs field’ of energy which permeates the universe, and gives mass to particles such as electrons through its own fundamental particle, the so-called Higgs boson. When researchers at CERN finally reported the discovery of the Higgs boson in 2012, it was announced as a ‘five-sigma’ result.5 But few people would have realized this was an expression of statistical significance.

When the researchers plotted the rate at which specific events occurred for different energy levels, the curve was found to have a distinct ‘hump’ just where it would be expected if the Higgs boson existed. Crucially, a form of chisquared goodness-of-fit test revealed a P-value of less than 1 in 3.5 million, under the null hypothesis that the Higgs did not exist and the ‘hump’ was simply the result of random variation. But why was this reported as a ‘five-sigma’ discovery?

It is standard in theoretical physics to report claims of discoveries in terms of ‘sigmas’, where a ‘two-sigma’ result is an observation that is two standard errors away from the null hypothesis (remember that we used sigma (σ) as the Greek letter representing a population standard deviation): the ‘sigmas’ in theoretical physics correspond precisely to the t-value in the computer output shown in Table 10.5 for the multiple regression example. Since an observation that gave a two-sided P-value of 1 in 3.5 million—that observed from the chi-squared test—would be five standard errors from the null hypothesis, the Higgs boson was therefore said to be a five-sigma result.

The team at CERN clearly did not want to announce their ‘discovery’ until the P-value was extremely small. First, they needed to allow for the fact that significance tests had been carried out at all energy levels, not just the one in the final chisquared test—this adjustment for multiple testing is known as the ‘look elsewhere effect’ in physics. But mainly they wanted to be confident that any attempt at replication would come up with the same conclusion. It would simply be too embarrassing to make an incorrect claim about the laws of physics.

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)