Home > The Art of Statistics How to Learn from Data(18)

The Art of Statistics How to Learn from Data(18)
Author: David Spiegelhalter

Results of a multiple linear regression relating adult offspring height to that of their mother and father. The ‘intercept’ is the average height of offspring (Table 5.1). The coefficients of multiple regression indicate the predicted change in adult offspring height for each one-inch change from the average parental height.


Let’s return to the Swedish study on brain tumours that we saw in Chapter 4 as an example of an inappropriate media interpretation of causation. A regression analysis had the rate of tumours as the dependent, or response, variable, and education as the independent, or explanatory, variable of interest. Other factors entered into the regression included age at diagnosis, calendar year, region of Sweden, marital status and income, all of which were considered to be potential confounding variables. This adjustment for confounders is an attempt to tease out a purer relationship between education and brain tumours, but it can never be wholly adequate. There will always remain the suspicion that some other lurking process might be at work, such as those with higher education seeking better health care and increased diagnoses.

In a randomized trial, there should be no need to adjust for confounders, as the random allocation should guarantee that all factors other than the main treatment should be balanced between groups. But researchers often still carry out a regression analysis anyway, just in case some imbalances have slipped in.

 

 

Different Types of Response Variables


Not all data are continuous measurements such as height. In much of statistical analysis, the dependent variables may be the proportions of events that either happen or not (for example, the proportion of people who survive surgery), counts of the numbers of events (for example, how many cancers occur per year in a certain area), or the length of time before an event occurs (for example, years of survival following surgery). Each type of dependent variable has its own form of multiple regression, with a correspondingly different interpretation of the estimated coefficients.6

Consider the child heart surgery data discussed in Chapter 2, where Figure 2.5(a) showed the proportions surviving surgery and the number of cases treated in each hospital between 1991 and 1995. The scatter-plot is shown again in Figure 5.2 with a regression line that has been fitted without using the outlying data-point corresponding to Bristol.

While we could have fitted a linear regression line through these points, naïve extrapolation would suggest that if a hospital treated a huge number of cases, their survival would be predicted to be greater than 100%, which is absurd. So a form of regression has been developed for proportions, called logistic regression, which ensures a curve which cannot go above 100% or below 0%.

Even without taking Bristol into account, hospitals with more patients had better survival rates, and the logistic regression coefficient (0.001) means the mortality rate is expected to be around 10% lower (relatively) for each additional 100 operations that a hospital conducts on under-1s over a four-year period.* Of course, to use what is now rather a cliché, correlation does not mean causation, and we cannot conclude that bigger throughput is the reason for the better performance: as we mentioned previously, there could even be reverse causation, with hospitals with a good reputation attracting more patients.

 

 

Figure 5.2

Fitted logistic regression model for child heart surgery data for under-1s in UK hospitals between 1991 and 1995. Hospitals treating more patients have better survival. The line is part of a curve that will never reach 100%, and is fitted ignoring the outlying data-point representing Bristol.

 

 

This was a controversial finding when it was released in 2001, and has contributed to prolonged, and still unresolved, disputes about how many hospitals in the UK should conduct this form of surgery.

 

 

Beyond Basic Regression Modelling


The techniques outlined in this chapter have worked remarkably well since their introduction more than a century ago. But both the availability of large amounts of data and the extraordinary increase in computing power have allowed far more sophisticated models to be developed. Very broadly, four main modelling strategies have been adopted by different communities of researchers:

• Rather simple mathematical representations for associations, such as the linear regression analyses in this chapter, which tend to be favoured by statisticians.

• Complex deterministic models based on scientific understanding of a physical process, such as those used in weather forecasting, which are intended to realistically represent underlying mechanisms, and which are generally developed by applied mathematicians.

• Complex algorithms used to make a decision or prediction that have been derived from an analysis of huge numbers of past examples, for example to recommend books you might like to buy from an online retailer, and which come from the world of computer science and machine learning. These will often be ‘black boxes’ in the sense that they may make good predictions, but their internal structure is somewhat inscrutable—see the next chapter.

• Regression models that claim to reach causal conclusions, as favoured by economists.

 

These are huge generalizations, and fortunately professional barriers are breaking down and we shall see later that a more ecumenical approach to modelling is developing. But whatever the strategy adopted, common issues arise when building and using a model.

A good analogy is that a model is like a map, rather than the territory itself. And we all know that some maps are better than others: a simple one might be good enough to drive between cities, but we need something more detailed when walking through the countryside. The British statistician George Box has become famous for his brief but invaluable aphorism: ‘All models are wrong, some are useful.’ This pithy statement was based on a lifetime spent bringing statistical expertise to industrial processes, which led Box to appreciate both the power of models, but also the danger of actually starting to believe in them too much.

But these cautions are easily forgotten. Once a model becomes accepted, and especially when it is out of the hands of those who created it and understand its limitations, then it can start acting as a sort of oracle. The financial crisis of 2007–2008 has to a large extent been blamed on the exaggerated trust placed in complex financial models used to determine the risk of, say, bundles of mortgages. These models assumed only a moderate correlation between mortgage failures, and worked well while the property market was booming. But when conditions changed and mortgages starting failing, they tended to fail in droves: the models grossly underestimated the risks due to the correlations turning out to be far higher than supposed. Senior managers simply did not realize the frail basis on which these models were built, losing track of the fact that models are simplifications of the real world—they are the maps not the territory. The result was one of the worst global economic crises in history.

 

 

Summary


• Regression models provide a mathematical representation between a set of explanatory variables and a response variable.

• The coefficients in a regression model indicate how much we expect the response to change when the explanatory variable is observed to change.

• Regression-to-the-mean occurs when more extreme responses revert to nearer the long-term average, since a contribution to their previous extremeness was pure chance.

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)