Home > The Art of Statistics How to Learn from Data(32)

The Art of Statistics How to Learn from Data(32)
Author: David Spiegelhalter

 

 

Figure 9.1

The probability distribution of the observed proportion of left-handers in random samples of 1, 2, 5, 10, 100 and 1,000 people, where the true underlying proportion of left-handers in the population is 0.2. The probability of getting at least 30% left-handers in the sample is found by adding all the probability in the bars to the right of 0.3.

 

 

Figure 9.1 has some distinctive features. First, the probability distributions tend to a regular, symmetric, normal shape as the sample size increases, just as we observed using bootstrap simulations. Second, the distributions get tighter as the sample size increases. The following example shows how a simple application of these ideas can be used to rapidly identify whether a statistical claim is reasonable or not.


Do some areas of the UK really have three times the bowel-cancer death rates than others?

 

The headline on the respected BBC news website in September 2011 was alarming: ‘Threefold Variation in UK Bowel Cancer Death Rates’. The article went on to explain that different areas in the UK had starkly different death rates from bowel cancer, with a commentator suggesting it was ‘extremely important for local NHS organizations to examine information for their own areas and use it to inform potential changes in delivery of services’.

A threefold difference sounds extraordinarily dramatic. But when the blogger Paul Barden came across the article, he wondered, ‘Do people in different parts of the country really face such large and important differences in their risk of dying from bowel cancer? What would cause such a discrepancy?’ He found it so implausible he decided to investigate. Admirably, the data was openly available online and he found that it did substantiate what the BBC piece had claimed: in 2008 there was more than a threefold variation between the annual death rates of people with bowel cancer. It ranged from 9 per 100,000 people in Rossendale in Lancashire to 31 per 100,000 inhabitants of Glasgow City.1

But this was not the end of his investigation. He then plotted the death rates against the population in each district, which gave the picture shown in Figure 9.2. It is clear that the points (all apart from the extreme example of Glasgow City) form a sort of funnel shape, in which the differences between districts get larger as their population gets smaller. Paul then added control limits which show where we would expect the points to land if the differences between the observed rates were just due to natural and unavoidable variability in the numbers that die of bowel cancer each year, rather than due to any systematic variation in the underlying risks experienced in different districts. These control limits are obtained from assuming that the number of bowel cancer deaths in each area are an observation from a binomial distribution with a sample size equal to the adult population of the area, and an underlying probability of 0.000176 that any particular person would die from bowel cancer each year: this is the average individual risk over the whole country. The control limits are set to contain 95% and 99.8% of the probability distribution respectively. This type of graph is called a funnel plot and is extensively used when examining multiple health authorities or institutions, since it permits the identification of outliers without creating spurious league tables.

The data fall within the control limits rather well, which means that differences between districts are essentially what we would expect by chance variability alone. Smaller districts have fewer cases and so are more vulnerable to the role of chance, and therefore tend to have more extreme results—the rate in Rossendale was based on only 7 deaths, and so its rate could be drastically altered by just a few extra cases. So despite the BBC’s dramatic headline, there is no big news story here—we would expect a threefold variability in the observed rates, even if the underlying risk in the different districts were precisely the same.

 

 

Figure 9.2

Annual bowel-cancer death rates per 100,000 population in 380 districts in the UK, plotted against the population of the district. The two sets of dashed lines indicate the regions in which we would expect 95% and 99.8% of districts to lie, if there were no real differences between the risks, and they are derived from an assumed underlying binomial distribution. Only Glasgow City shows any evidence of an underlying risk that is different from the average. This way of looking at the data is called a ‘funnel plot’.

 

 

There is a crucial lesson in this simple example. Even in an era of open data, data science and data journalism, we still need basic statistical principles in order not to be misled by apparent patterns in the numbers.

This chart reveals that the only observation of any particular note is Glasgow City’s outlying data-point. Is bowel cancer a particularly Scottish phenomenon? Is this data-point actually correct? More recent data for the period 2009–2011 reveals that bowel-cancer mortality for Greater Glasgow was 20.5 per 100,000 people, in Scotland overall it was 19.6, and in England it was 16.4: these findings both cast doubt on the specific Glasgow City value and show that Scotland has higher rates than England. Typically, conclusions from one problem-solving cycle raise more questions, and so the cycle starts over again.

 

 

The Central Limit Theorem


Individual data-points might be drawn from a wide variety of population distributions, some of which might be highly skewed, with long tails such as those of income or sexual partners. But we have now made the crucial shift to considering distributions of statistics rather than individual data-points, and these statistics will commonly be averages of some sort. We have already seen in Chapter 7 that the distribution of the sample means of bootstrap resamples tends to a well-behaved symmetric shape, whatever the shape of the original distribution of the data, and we can now go beyond this to a deeper and rather remarkable idea, established around 300 years ago.

The example of left-handers shows that the variability in the observed proportion gets smaller as the sample size increases—this is why the funnel in Figure 9.2 gets narrower around the mean. This is the classic Law of Large Numbers, which was established by Swiss mathematician Jacob Bernoulli in the early eighteenth century—a single coin flip, taking on the value 1 if a head occurs, and 0 if a tail, is said to be a Bernoulli trial and have a Bernoulli distribution. If you keep on flipping a balanced coin, carrying out more and more Bernoulli trials, then the proportion of each outcome will get closer and closer to 50% heads and 50% tails—we say the observed proportion converges to the true underlying chance of a head. Of course, early on in the sequence the ratio may be some way from 50:50, say after a run of heads, and the temptation is to believe that tails is somehow now ‘due’ so that the proportion gets balanced out—this is known as the ‘gambler’s fallacy’ and is a psychological bias that (from personal experience) is rather difficult to overcome. But the coin has no memory—the key insight is that the coin cannot compensate for past imbalances, but simply overwhelms them by more and more new, independent flips.

In Chapter 3 we introduced the classic ‘bell-shaped curve’, also known as the normal or Gaussian distribution, where we showed it described well the distribution of birth weights in the US population, and argued that this was because birth weight depends on a huge number of factors, all of which have a little influence—when we add up all those small effects we get a normal distribution.

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)