Home > The Art of Statistics How to Learn from Data(37)

The Art of Statistics How to Learn from Data(37)
Author: David Spiegelhalter

 

 

Figure 10.2 shows that the actual observed difference in proportions of right-armers (7% in favour of females) lies fairly near the centre of the distribution of observed differences that we would expect to see, if in truth there were no association at all. We need a measure to summarize how close to the centre our observed value lies, and one summary is the ‘tail-area’ to the right of the dashed line shown in Figure 10.2, which is 45% or 0.45.

This tail-area is known as a P-value, one of the most prominent concepts in statistics as practised today, and which therefore deserves a formal definition in the text: A P-value is the probability of getting a result at least as extreme as we did, if the null hypothesis (and all other modelling assumptions) were really true.

The issue, of course, is what do we mean by ‘extreme’? Our current P-value of 0.45 is one-tailed, since it only measures how likely it is that we would have observed such an extreme value in favour of females, were the null hypothesis really true. This P-value corresponds to what is known as a one-sided test. But an observed proportion in favour of males would also have led us to suspect the null hypothesis did not hold. We should therefore also calculate the chance of getting an observed difference of at least 7%, in either direction. This is known as a two-tailed P-value, corresponding to a two-sided test. This total tail area turns out to be 0.89, and since this value is near one it indicates that the observed value is near the centre of the null distribution. Of course this could be seen immediately from Figure 10.2, but such graphs will not always be available and we need a number to formally summarize the extremeness of our data.

Arbuthnot provided the first recorded example of this process: under the null hypothesis that boys and girls were equally likely to be born, the probability of boys exceeding girls in all 82 years was 1/282. This only defines extremeness in terms of boys exceeding girls, and we would also doubt the null hypothesis if girls exceeded boys, and so we should double this number to 1/281 to give the probability of such an extreme result in either direction. So 1/281 might be considered the first recorded two-sided P-value, although the term was not used for another two hundred and fifty years.

Incidentally, my small sample indicated no link between gender and arm-crossing, and other, more scientific studies have not found a relationship between arm-crossing behaviour and gender, handedness or any other feature.

 

 

Statistical Significance


The idea of statistical significance is straightforward: if a P-value is small enough, then we say the results are statistically significant. This term was popularized by Ronald Fisher in the 1920s and, in spite of the criticisms we shall see later, continues to play a major role in statistics.

Ronald Fisher was an extraordinary, but difficult, man. He was extraordinary because he is regarded as a pioneering figure in two distinct fields—genetics and statistics. Yet he had a notorious temper and could be extremely critical of anyone who he felt questioned his ideas, while his support for eugenics and his public criticism of the evidence for the link between smoking and lung cancer damaged his standing. His personal reputation has suffered as his financial connections with the tobacco industry have been revealed, but his scientific reputation is undiminished, as his ideas find repeated new applications in the analysis of large data sets.

As I mentioned in Chapter 4, Fisher developed the idea of randomization in agricultural trials while working at Rothamsted Experimental Station. He further illustrated the ideas of randomization in experimental design with his famous tea tasting test, in which a woman (thought to be a Muriel Bristol) claimed to be able to tell by tasting a cup of tea whether the milk had been added before or after tea was poured into a cup.

Four cups with milk first, and four with tea first, were prepared and the eight cups were presented in a random order; Muriel was told there were four of each, and had to guess which four had milk first. She is said to have got them all right, which another application of the hypergeometric distribution shows has a probability of 1 in 70 under the null hypothesis that she was guessing. This is an example of a P-value, and would by convention be considered small, and so the results could be declared to be statistically significant evidence that she could in fact tell whether the milk was put in first or not.

To summarize, I have described the following steps:

1. Set up a question in terms of a null hypothesis that we want to check. This is generally given the notation H0.

2. Choose a test statistic that estimates something that, if it turned out to be extreme enough, would lead us to doubt the null hypothesis (often larger values of the statistic indicate incompatibility with the null hypothesis).

3. Generate the sampling distribution of this test statistic, were the null hypothesis true.

4. Check whether our observed statistic lies in the tails of this distribution and summarize this by the P-value: the probability, were the null hypothesis true, of observing such an extreme statistic. The P-value is therefore a particular tail-area.

5. ‘Extreme’ has to be defined carefully—if say both large positive and large negative values of the test statistic would have been considered incompatible with the null hypothesis, then the P-value has to take this into account.

6. Declare the result statistically significant if the P-value is below some critical threshold.

 

Ronald Fisher used P < 0.05 and P < 0.01 as convenient critical thresholds for indicating significance, and produced tables of the critical values of test statistics needed to achieve these levels of significance. The popularity of these tables led to 0.05 and 0.01 becoming established conventions, although it is now recommended that exact P-values should be reported. And it is important to emphasize that the exact P-value is conditional not only on the truth of the null hypothesis, but also on all other assumptions underlying the statistical model, such as lack of systematic bias, independent observations, and so on.

This whole process has become known as Null Hypothesis Significance Testing (NHST) and, as we shall see below, it has become a source of major controversy. But first we should examine how Fisher’s ideas are used in practice.

 

 

Using Probability Theory


Perhaps the most challenging component in null-hypothesis significance testing is Step 3—establishing the distribution of the chosen test statistic under the null hypothesis. We can always fall back on computer-intensive simulation methods as in the permutation test for the arm-crossing data, but it is far more convenient if we can use probability theory to work out the tail areas of test statistics directly, as Arbuthnot did in a simple case, and Fisher did with the hypergeometric distribution.

Often we make use of approximations that were developed by the pioneers of statistical inference. For example, around 1900 Karl Pearson developed a series of statistics for testing associations in cross-tabulations such as Table 10.1, out of which grew the classic chi-squared test of association.*

These test statistics involve calculating the expected number of events in each cell of the table, were the null hypothesis of no-association true, and then a chi-squared statistic measures the total discrepancy between the observed and expected counts. Table 10.2 shows the expected numbers in the cells of the table, assuming the null hypothesis: for example, the expected number of females with left arm on top is the total number of females (14), times the overall proportion of left-armers (22/54), which comes to 5.7.

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)