Home > The Art of Statistics How to Learn from Data(17)

The Art of Statistics How to Learn from Data(17)
Author: David Spiegelhalter

observation = deterministic model + residual error.

This formula can be interpreted as saying that, in the statistical world, what we see and measure around us can be considered as the sum of a systematic mathematical idealized form plus some random contribution that cannot yet be explained. This is the classic idea of the signal and the noise.


Do speed cameras reduce accidents?

 

This section contains a simple lesson: just because we act, and something changes, it doesn’t mean we were responsible for the result. Humans seem to find this simple truth difficult to grasp—we are always keen to construct an explanatory narrative, and even keener if we are at its centre. Of course sometimes this interpretation is true—if you flick a switch, and the light comes on, then you are usually responsible. But sometimes your actions are clearly not responsible for an outcome: if you don’t take an umbrella, and it rains, it is not your fault (although it may feel that way). But the consequences of many of our actions are less clear-cut. Suppose you have a headache, take an aspirin, and your headache goes away. How do you know it wouldn’t have gone away even if you had not taken a tablet?

We have a strong psychological tendency to attribute change to intervention, and this makes before-and-after comparisons treacherous. A classic example concerns speed cameras, which tend to get put in places that have recently experienced accidents. When the accident rate subsequently goes down, this change is then attributed to the presence of the cameras. But would the accident rates have gone down anyway?

Strings of good (or bad) luck do not go on for ever, and eventually things settle back down—this can also be considered as regression-to-the-mean, just like tall fathers tending to have shorter sons. But if we believe these runs of good or bad fortune represent a constant state of affairs, then we will wrongly attribute the reversion to normal as the consequence of any intervention we have made. Perhaps this all seems rather obvious, but this simple idea has remarkable ramifications, such as:

• Football managers who get sacked after a string of losses, only to find their successors getting credit for the return to normal.

• Active fund managers dropping in performance, after being tipped (and perhaps getting large bonuses) after a couple of good years.

• The ‘Curse of Sports Illustrated’, in which athletes get featured on the cover of a prominent magazine following a series of achievements, only to subsequently have their performance plummet.

 

Luck plays a considerable part in the position that sports teams have in their league table, and a consequence of regression-to-the-mean means we would expect teams that do well one year to decline the following year, and those that do badly to improve their position, particularly if the teams are fairly evenly matched. Conversely, if we see this pattern of changes, we might suspect that regression-to-the-mean is operating and not take too much notice of claims about the influence of, say, new training methods.

It is not only sports teams that are ranked in league tables. Take the example of the PISA Global Education Tables, which compare different countries’ school systems in mathematics. A change in league table position between 2003 and 2012 was strongly negatively correlated with initial position, meaning that countries at the top tended to go down, and those at the bottom tended to go up. The correlation was −0.60, and some theory shows that if the rankings were complete chance and all that was operating were regression-to-the-mean, the correlation would be expected to be −0.71, not very different from what was observed.4 This suggests the differences between countries were far less than claimed, and that changes in league position had little to do with changes in teaching philosophy.

Regression-to-the-mean also operates in clinical trials. In the last chapter we saw that randomized trials were needed to evaluate new pharmaceuticals properly, since even people in the control arm showed benefit—the so-called placebo effect. This is often interpreted to mean that just taking a sugar pill (preferably a red one) actually has a beneficial effect on people’s health. But much of the improvement seen in people who do not receive any active treatment may be regression-to-the-mean, since patients are enrolled in trials when they are showing symptoms, and many of these would have resolved anyway.

So if we want to know the genuine effect of installing speed cameras in accident black spots, then we should follow the approach used for evaluating pharmaceuticals and take the bold step of randomly allocating speed cameras. When such studies have been conducted, it is estimated that about two-thirds of the apparent benefit from cameras is due to regression-to-the-mean.5

 

 

Dealing With More Than One Explanatory Variable


Since Galton’s early work there have been many extensions to the basic idea of regression, vastly helped by modern computing. These developments include:

• having many explanatory variables

• explanatory variables that are categories rather than numbers

• having relationships that are not straight lines and adapt flexibly to the pattern of the data

• response variables that not continuous variables, such as proportions and counts

 

As an example of having more than one explanatory variable, we can look at how the height of a son or daughter is related to the height of their father and their mother. The scatter of data-points is now in three dimensions and becomes much more difficult to draw on a page, but we can still use the idea of least-squares to work out the formula that best predicts offspring height. This is known as a multiple linear regression.* When we just had one explanatory variable the relationship with the response variable was summarized by a gradient, which can also be interpreted as a coefficient in a regression equation; this idea can be generalized to more than one explanatory variable.

The results for Galton’s families are shown in Table 5.3. How can we interpret the coefficients shown here? First, they are part of a formula that could be used to predict adult offspring height for a particular mother and father.* But they also illustrate the idea of adjustment of an apparent relationship, by taking account of a third, confounding factor.

For example, we saw in Table 5.2 that the gradient when regressing the height of daughters on their mother’s height was 0.33—remember that the gradient of a line fitted to a scatter-plot is the just another name for the regression coefficient. Table 5.3 shows that, if we also allow for the effect of father’s height, this coefficient is reduced to 0.30. When predicting a son’s height, the regression coefficient for the father is similarly reduced from 0.45 in Table 5.2 to 0.41 in Table 5.3, when the mother’s height is taken into account. So the height of a parent has a slightly reduced association with their adult offspring’s height, when allowing for the effect of the other parent. This could be due to the fact that taller women tend to marry taller men, so that each parent’s height is not a completely independent factor. Overall, the data suggests a one inch difference in a father’s height is associated with a bigger difference in an adult child’s height than a one inch difference in a mother’s height. Multiple regression is often used when researchers are interested in one particular explanatory variable, and other variables need to be ‘adjusted for’ to allow for imbalances.

*

 

 

Table 5.3

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)