Home > The Art of Statistics How to Learn from Data(33)

The Art of Statistics How to Learn from Data(33)
Author: David Spiegelhalter

This is the reasoning behind what is known as the Central Limit Theorem, first proved in 1733 by French mathematician Abraham de Moivre for the particular case of the binomial distribution. But it is not just the binomial distribution that tends to a normal curve with increasing sample size—it is a remarkable fact that virtually whatever the shape of the population distribution from which each of the original measurements are sampled, for large sample sizes their average can be considered to be drawn from a normal curve.* This will have a mean that is equal to the mean of the original distribution and a standard deviation that has a simple relationship to the standard deviation of the original population distribution and, as already mentioned, is often known as the standard error.*

Apart from his work on wisdom of crowds, correlation, regression, and almost everything else, Francis Galton also considered it a true marvel that the normal distribution, then known as the Law of Frequency of Error, should arise in an orderly way out of apparent chaos:


I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the ‘Law of Frequency of Error’. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.

 

He was right—it really is an extraordinary law of nature.

 

 

How Does This Theory Help Us Work Out the Accuracy of Our Estimates?


All this theory is fine for proving things about distributions of statistics based on data drawn from known populations, but that is not what we are mostly interested in. We have to find a way of reversing the process: instead of going from known populations to saying something about possible samples, we need to go from a single sample back to saying something about a possible population. This is the process of inductive inference outlined in Chapter 3.

Suppose I have a coin, and I ask you for your probability that it will come up heads. You happily answer ‘50:50’, or similar. Then I flip it, cover up the result before either of us sees it, and again ask for your probability that it is heads. If you are typical of my experience, you may, after a pause, rather grudgingly say ‘50:50’. Then I take a quick look at the coin, without showing you, and repeat the question. Again, if you are like most people, you eventually mumble ‘50:50’.

This simple exercise reveals a major distinction between two types of uncertainty: what is known as aleatory uncertainty before I flip the coin—the ‘chance’ of an unpredictable event—and epistemic uncertainty after I flip the coin—an expression of our personal ignorance about an event that is fixed but unknown. The same difference exists between a lottery ticket (where the outcome depends on chance) and a scratch card (where the outcome is already decided, but you don’t know what it is).

Statistics are used when we have epistemic uncertainty about some quantity of the world. For example, we conduct a survey when we don’t know the true proportion in a population that consider themselves religious, or we run a pharmaceutical trial when we don’t know the true average effect of a drug. As we have seen, these fixed but unknown quantities are called parameters and are often given a Greek letter.* Just like my coin-flipping example, before we do these experiments we have aleatory uncertainty about what the outcomes may be, because of the random sampling of individuals or the random allocation of patients to the drug or a dummy tablet. Then after we have done the study and got the data, we use this probability model to get a handle on our current epistemic uncertainty, just as you were eventually prepared to say ‘50:50’ about the covered-up coin. So probability theory, which tells us what to expect in the future, is used to tell us what we can learn from what we have observed in the past. This is the (rather remarkable) basis for statistical inference.

The procedure for deriving an uncertainty interval around our estimate, or equivalently a margin of error, is based on this fundamental idea. There are three stages:

1. We use probability theory to tell us, for any particular population parameter, an interval in which we expect the observed statistic to lie with 95% probability. These are 95% prediction intervals, such as those displayed in the inner funnel in Figure 9.2.

2. Then we observe a particular statistic.

3. Finally (and this is the difficult bit) we work out the range of possible population parameters for which our statistic lies in their 95% prediction intervals. This we call a ‘95% confidence interval’.

4. This resulting confidence interval is given the label ‘95%’ since, with repeated application, 95% of such intervals should contain the true value.*

 

All clear? If it isn’t, then please be reassured that you have joined generations of baffled students. Specific formulae are provided in the Glossary, but the details are less important than the fundamental principle: a confidence interval is the range of population parameters for which our observed statistic is a plausible consequence.

 

 

Calculating Confidence Intervals


The principle of confidence intervals was formalized in the 1930s at University College London by Jerzy Neyman, a brilliant Polish mathematician and statistician, and Egon Pearson, Karl Pearson’s son.* The work of deriving the necessary probability distributions of estimated correlation coefficients and regression coefficients had been going on for decades beforehand, and in standard academic statistics courses the mathematical details of these distributions would be provided, and even derived from first principles. Fortunately the results of all these labours are now encapsulated in statistical software, and so practitioners can focus on the essential issues and not be distracted by complex formulae.

We saw in Chapter 7 how bootstrapping could be used to get 95% intervals for the gradient of Galton’s regression of daughters’ on mothers’ heights. It is far easier to obtain exact intervals that are based on probability theory and provided in standard software, and Table 9.1 shows they give very similar results. The ‘exact’ intervals based on probability theory require more assumptions than the bootstrap approach, and strictly speaking would only be precisely correct if the underlying population distribution were normal. But the Central Limit Theorem means that with such a large sample size it is reasonable to assume our estimates have got normal distributions and so the exact intervals are acceptable.

*

 

 

Table 9.1

Estimates of the regression coefficient summarizing the relationship between mothers’ and daughters’ heights, with exact and bootstrap standard errors and 95% confidence intervals—the bootstrap is based on 1,000 resamples.


It is conventional to use 95% intervals, which are generally set as plus or minus two standard errors, but narrower (for example, 80%) or wider (for example, 99%) intervals are sometimes adopted.* The US Bureau of Labor Statistics use 90% intervals for unemployment, whereas the UK Office for National Statistics use 95%: it is essential to be clear which is being used.

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)