Home > The Art of Statistics How to Learn from Data(11)

The Art of Statistics How to Learn from Data(11)
Author: David Spiegelhalter

Perhaps the most crucial lesson from this example is that the dark-grey shaded area in Figure 3.2(d) plays two roles:

1. It represents the proportion of this population of babies being low birth weight.

2. It is also the probability that a randomly chosen baby in 2013 weighs less than 2,500 g.

 

So a population can be thought of as a physical group of individuals, but also as providing the probability distribution for a random observation. This dual interpretation will be fundamental when we come to more formal statistical inference.

Of course in this case we know the shape and parameters of the population, and so we can say something both about the proportions in the population, and the chances of different events occurring for a random observation. But the whole point of this chapter is that we do not generally know about populations, and so want to follow the inductive process and go the other way around, from data to population. We have seen that the standard measures of mean, median, mode, and so on, which we developed for samples, extend to whole populations—but the difference is that we do not know what they are. And that is the challenge we face in the next chapter.

 

 

What Is the Population?


The stages of induction outlined above work well with planned surveys, but a lot of statistical analysis does not fit as easily into this framework. We have seen that, particularly when using administrative records such as police reports on crime, we may have all the possible data. But although there is no sampling, the idea of an underlying population can still be valuable.

Consider the children’s heart surgery data in Chapter 1. We made the rather bold assumption there were no measurement problems—in other words we have a complete collection of both the operations and 30-day survivors in each hospital. So our knowledge of the sample (Stage 2) is perfect.

But what is the study population? We have data on all the children and all the hospitals, and so there is no larger group from which they have been sampled. Although the idea of a population is usually introduced rather casually into statistics courses, this example shows it is a tricky and sophisticated idea that is worth exploring in some detail, as a lot of important ideas build on this concept.

There are three types of populations from which a sample might be drawn, whether the data come from people, transactions, trees, or anything else.

• A literal population. This is an identifiable group, such as when we pick a person at random when polling. Or there may be a group of individuals who could be measured, and although we don’t actually pick one at random, we have data from volunteers. For example, we might consider the people who guessed at the number of jelly beans as a sample from the population of all maths nerds who watch YouTube videos.

• A virtual population. We frequently take measurements using a device, such as taking someone’s blood pressure or measuring air pollution. We know we could always take more measurements and get a slightly different answer, as you will know if you have ever taken repeat blood pressure measurements. The closeness of the multiple readings depends on the precision of the device and the stability of the circumstances—we might think of this as drawing observations from a virtual population of all the measurements that could be taken if we had enough time.

• A metaphorical population, when there is no larger population at all. This is an unusual concept. Here we act as if the data-point were drawn from some population at random, but it clearly is not—as with the children having heart surgery: we did not do any sampling, we have all the data, and there is no more we could collect. Think of the number of murders that occur each year, the examination results for a particular class, or data on all the countries of the world—none of these can be considered as a sample from an actual population.

 

The idea of a metaphorical population is challenging, and it may be best to think of what we have observed as having been drawn from some imaginary space of possibilities. For example, the history of the world is what it is, but we can imagine history having played out differently, and we happen to have ended up in just one of these possible states of the world. This set of all the alternative histories can be considered a metaphorical population. To be more concrete, when we looked at childhood heart surgery in the UK between 2012 and 2015, we had all the data on surgery in those years and knew how many deaths and how many survivors there were. Yet we can imagine counterfactual histories in which different individuals might have survived, through unforeseeable circumstances that we tend to call ‘chance’.

It should be apparent that rather few applications of statistical science actually involve literal random sampling, and that it is increasingly common to have all the data that is potentially available. Nevertheless it is extremely valuable to keep hold of the idea of an imaginary population from which our ‘sample’ is drawn, as then we can use all the mathematical techniques that have been developed for sampling from real populations.

Personally, I rather like acting as if all that occurs around us is the result of some random pick from the all the possible things that could happen. It is up to us whether we choose to believe it is truly chance, whether it the will of a god or gods, or any other theory of causation: it makes no difference to the mathematics. This is just one of the mind-stretching requirements for learning from data.

 

 

Summary


• Inductive inference requires working from our data, through study sample and study population, to a target population.

• Problems and biases can crop up at each stage of this path.

• The best way to proceed from sample to study population is to have drawn a random sample.

• A population can be thought of as a group of individuals, but also as providing the probability distribution for a random observation drawn from that population.

• Populations can be summarized using parameters that mirror the summary statistics of sample data.

• Often data does not arise as a sample from a literal population. When we have all the data there is, then we can imagine it drawn from a metaphorical population of events that could have occurred, but didn’t.

 

 

CHAPTER 4


What Causes What?

 

Does going to university increase the risk of getting a brain tumour?

 

Epidemiology is the study of how and why diseases occur in the population, and Scandinavian countries are an epidemiologist’s dream. This is because everyone in those countries has a personal identity number which is used when registering for health care, education, tax, and so on, and this allows researchers to link all these different aspects of people’s lives together in a way that would be impossible (and perhaps politically controversial) in other countries.

A typically ambitious study was conducted on over 4 million Swedish men and women whose tax and health records were linked over eighteen years, which enabled the researchers to report that men with a higher socioeconomic position had a slightly increased rate of being diagnosed with a brain tumour. This was one of those worthy but rather unexciting studies that would typically not attract much attention, so a university communications officer thought it would be more interesting to say in a press release that ‘High levels of education are linked to heightened brain tumour risk’, even though the study was about socioeconomic position rather than education. And by the time this got to the general public, a subeditor in a newspaper produced the classic headline, ‘Why Going to University Increases Risk of Getting a Brain Tumour’.1

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)