The Art of Statistics How to Learn from Data(5)

The Art of Statistics How to Learn from Data(5)
Author: David Spiegelhalter



Describing the Spread of a Data Distribution

It is not enough to give a single summary for a distribution—we need to have an idea of the spread, sometimes known as the variability. For example, knowing the average adult male shoe size will not help a shoe firm decide the quantities of each size to make. One size does not fit all, a fact which is vividly illustrated by the seats for passengers in planes.

Table 2.1 shows a variety of summary statistics for the jelly-bean guesses, including three ways to summarize the spread. The range is a natural choice, but is clearly very sensitive to extreme values such as the apparently bizarre guess of 31,337 beans.* In contrast the inter-quartile range (IQR) is unaffected by extremes. This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’ of the numbers, in this case between 1,109 and 2,599 beans: the central ‘box’ of the box-and-whisker plots shown above covers the inter-quartile range. Finally the standard deviation is a widely used measure of spread. It is the most technically complex measure, but is only really appropriate for well-behaved symmetric data* since it is also unduly influenced by outlying values. For example, removing the single (almost certainly mistaken) value of 31,337 from the data reduces the standard deviation from 2,422 to 1,398.*




Table 2.1

Summary statistics for 915 jelly-bean judgements. The true number was 1,616.

The crowd in our little experiment showed itself to have considerable wisdom, in spite of some bizarre responses. This demonstrates that data often has some errors, outliers and other strange values, but these do not necessarily need to be individually identified and excluded. It also points to the benefits of using summary measures that are not unduly affected by odd observations such as 31,337—these are known as robust measures, and include the median and the inter-quartile range. Finally, it shows the great value of simply looking at the data, a lesson that will be reinforced by the next example.



Describing Differences between Groups of Numbers


How many sexual partners do people in Britain report having had in their lifetime?


The purpose of this question is not simply to be nosey about people’s private lives. When AIDS first became a serious concern in the 1980s, public health officials realized that there was no reliable evidence about sexual behaviour in Britain, particularly in terms of the frequency with which people changed partners, how many had multiple simultaneous partners, and what sexual practices people engaged in. This knowledge was essential to predict the spread of sexually transmitted diseases through society and to plan health services, and yet people were still quoting from the unreliable data collected by Alfred Kinsey in the US in the 1940s—who made no attempt at obtaining a representative sample.

So beginning in the late 1980s, large, careful and costly surveys of sexual behaviour were established in the UK and US, in spite of strong opposition from some quarters. In the UK, Margaret Thatcher withdrew support from a major survey of sexual lifestyles at the last minute, but those conducting the study were fortunately able to find charitable funding instead, resulting in the National Sexual Attitudes and Lifestyle Survey (Natsal) which has been carried out in the UK every ten years since 1990.

The third survey, known as Natsal-3, was carried out around 2010 and cost £7 million.3 Table 2.2 shows the summary statistics concerning the number of (opposite-sex) sexual partners reported by people aged 35–44 in Natsal-3. It is a good exercise to use these summaries alone to try to reconstruct what the pattern of data might look like. We note that the most common single value (mode) is 1, representing those people who have only had one partner in their life, and yet there is also a massive range. This is also reflected by the substantial difference between the means and the medians, which is a telling sign of data distributions with long right-hand tails. The standard deviations are large, but this is an inappropriate measure of spread for such a data distribution, since it will be unduly influenced by a few extremely high values.

The responses of men and women may be compared by noting that men reported a mean-average of 6 more sexual partners than women, or alternatively that the average man (the median) reported 3 more sexual partners than the average woman. Or that, in relative terms, men report around 60% more partners than do women for both the mean and the median.




Table 2.2

Summary statistics for the number of (opposite-sex) sexual partners over their lifetime, as reported by 806 men and 1,215 women aged 35–44, based on interviews carried out in Natsal-3 between 2010 and 2012. Standard deviations are included for completeness, although they are inappropriate summaries of the spread of such data.

This difference might arouse our suspicions about the data. In a closed population with the same number of men and women with a similar age profile, it is a mathematical fact that the mean number of opposite-sex partners should be essentially the same for men and for women!* So why are men reporting so many more partners than women in this age group of 35–44? This could partly be because of men having younger partners, but also because there appears to be systematic differences in the way men and women count and report their sexual histories. We might suspect that men may be more likely to overplay their number of partners, or women underplay them, or both.

Figure 2.4 reveals the actual data distribution, which supports the impression given by the summary statistics of an extreme right-hand tail. But it is only by looking at this raw data that further important details are revealed, such as the strong tendency for both men and women to provide rounded numbers when there have been ten or more partners (except for the rather pedantic man, possibly a statistician, who said precisely, ‘Forty-seven’). You may, of course, wonder about the reliability of these self-reports, and potential biases in these data are discussed in the next chapter.

Large collections of numerical data are routinely summarized and communicated using a few statistics of location and spread, and the sexual-partner example has shown that these can take us a long way in grasping an overall pattern. However, there is no substitute for simply looking at data properly, and the next example shows that a good visualization is particularly valuable when we want to grasp the pattern in a large and complex set of numbers.



Figure 2.4

Data provided by Natsal-3 based on interviews between 2010 and 2012. The series have been truncated at 50 for reasons of space—the totals go up to 500 for both men and women. Note the clear use of round numbers for ten or more partners, and the tendency for men to report more partners than women.



Describing Relationships Between Variables


Do busier hospitals have higher survival rates?


There is a considerable interest in the so-called ‘volume effect’ in surgery—the claim that busier hospitals get better survival rates, possibly since they achieve greater efficiency and have more experience. Figure 2.5 shows 30-day survival rates in UK hospitals conducting heart surgery on children plotted against the number of children being treated. Figure 2.5(a) shows the data on children aged under 1 over the period 1991–1995 that was featured at the start of the last chapter, since this age group are higher risk and were the focus of the Bristol Inquiry. Figure 2.5(b) shows the data for all children under 16 in the period 2012–2015 that was previously shown in Table 1.1—specific data for children under 1 is not available for that period. Volume is plotted on the horizontal x-axis, and the survival rate on the vertical y-axis.*

