Home > The Art of Statistics How to Learn from Data(16)

The Art of Statistics How to Learn from Data(16)
Author: David Spiegelhalter

But first we return to Francis Galton. He had the classic Victorian gentleman scientist’s obsessive interest in collecting data, and eliciting the wisdom of crowds about the weight of an ox is only one example. He used his observations to make weather forecasts, to assess the efficacy of prayer and even to compare the relative beauty of young women in different parts of the country.* He also shared his cousin Charles Darwin’s fixation on inheritance, and set out to investigate the way that personal characteristics change between generations. He was particularly interested in the following question:


Using their parents’ heights, how can we predict an adult offspring’s height?

 

In 1886 Galton reported the heights of a large group of parents and their adult children, and summary statistics for the majority of the data are shown in Table 5.1.1 Galton’s sample had similar heights to contemporary adults (the average heights for adult women and men in the UK in 2010 was reported to be 63 and 69 inches respectively), which suggests his subjects were well-nourished and of higher socioeconomic status.

Figure 5.1 shows a scatter-plot of 465 sons’ heights against their fathers’ heights. The heights of fathers and sons are clearly correlated, with a Pearson correlation of 0.39. What if we wanted to predict a son’s height from his father’s? We might start by choosing a straight line to make our predictions, since that will enable us, for any father’s height, to calculate a prediction for their son’s stature. Our immediate intuition might be to use a diagonal line of ‘equality’, so that an adult son is predicted to have the same height as his father. But it turns out we can improve on this choice.

For any straight line we choose, each data-point will give rise to a residual (the vertical dashed lines on the plot), which is the size of the error were we to use the line to predict a son’s height from his father’s. We want a line that makes these residuals small, and the standard technique is to choose a least-squares fitted line, for which the sum of the squares of the residuals is smallest.* The formula for this line is straightforward (see the Glossary), and was developed by both French mathematician Adrien-Marie Legendre and Carl Friedrich Gauss at the end of the eighteenth century. The line is generally known as the ‘best-fit’ prediction we can make about a son’s height, from knowing his father’s.

*

 

 

Table 5.1

Summary statistics of recorded heights (in inches) of 197 sets of parents and their adult children recorded by Galton in 1886. For reference, 64 inches is 1.63 metres, 69 inches is 1.75 metres. Even without plotting the data, the closeness of the mean and median suggests a symmetric data distribution.

 

 

Figure 5.1

Scatter of heights of 465 fathers and sons from Galton’s data (many fathers are repeated since they have multiple sons). A jitter has been added to separate the points, and the diagonal dashed line represents exact equality between son and father’s heights. The solid line is the standard ‘best-fit’ line. Each point gives rise to a ‘residual’ (dashed line), which is the size of the error were we to use the line to predict a son’s height from his father’s.

 

 

The least-squares prediction line in Figure 5.1 goes through the middle of the cloud of points, representing the mean values for the heights of fathers and sons, but does not follow the diagonal line of ‘equality’. It is clearly lower than the line of equality for fathers who are taller than average, and higher than the line of equality for fathers who are shorter than average. This means that tall fathers tend to have sons who are slightly shorter than them, while shorter fathers have slightly taller sons. Galton called this ‘regression to mediocrity’, whereas now it is known as regression to the mean. The phenomenon also holds for mothers and daughters: taller mothers tend to have daughters who are shorter than them, and shorter mothers tend to have taller daughters. This explains the origin of the term in this chapter’s title: eventually any process of fitting lines or curves to data came to be called ‘regression’.


In basic regression analysis the dependent variable is the quantity that we want to predict or explain, usually forming the vertical y-axis of a graph—this is sometimes known as the response variable. While the independent variable is the quantity that we use for doing the prediction or explanation, usually forming the horizontal x-axis of a graph, and sometimes known as the explanatory variable. The gradient is also known as the regression coefficient.

Table 5.2 shows the correlations between parent and offspring heights, and the gradients of regression lines.* There is a simple relationship between the gradients, the Pearson correlation coefficient and the standard deviations of the variables.* In fact if the standard deviations of the independent and dependent variables are the same, then the gradient is simply the Pearson correlation coefficient, which explains their similarity in Table 5.2.

The meaning of these gradients depends completely on our assumptions about the relationship between the variables being studied. For correlational data, the gradient indicates how much we would expect the dependent variable to change, on average, if we observe a one unit difference for the independent variable. For example, if Alice is one inch taller than Betty, we would predict Alice’s adult daughter to be 0.33 inches taller than Betty’s adult daughter. Of course we would not expect this prediction to match their true difference in heights precisely, but it is the best guess we can make with the data available.

If, however, we assumed a causal relationship then the gradient has a very different interpretation—it is the change we would expect in the dependent variable were we to intervene and change the independent variable to a value one unit higher. This is definitely not the case for heights since they cannot be altered by experimental means, at least for adults. Even with the Bradford Hill criteria outlined above, statisticians are generally reluctant to attribute causation unless there has been an experiment, although computer scientist Judea Pearl and others have made great progress in setting out the principles for building causal regression models from observational data.2

*

 

 

Table 5.2

Correlations between heights of adult children and parent of the same gender, and gradients of the regression of the offspring’s on the parent’s height.

 

 

Regression Lines Are Models


The regression line we fitted between fathers’ and sons’ heights is a very basic example of a statistical model. The US Federal Reserve define a model as a ‘representation of some aspect of the world which is based on simplifying assumptions’: essentially some phenomenon will be represented mathematically, generally embedded in computer software, in order to produce a simplified ‘pretend’ version of reality.3

Statistical models have two main components. First, a mathematical formula that expresses a deterministic, predictable component, for example the fitted straight line that enables us to make a prediction of a son’s height from his father’s. But the deterministic part of a model is not going to be a perfect representation of the observed world. As we saw in Figure 5.1, there is a big scatter of heights around the regression line, and the difference between what the model predicts, and what actually happens, is the second component of a model and is known as the residual error—although it is important to remember that in statistical modelling, ‘error’ does not refer to a mistake, but the inevitable inability of a model to exactly represent what we observe. So in summary, we assume that

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)