Home > The Art of Statistics How to Learn from Data(30)

The Art of Statistics How to Learn from Data(30)
Author: David Spiegelhalter

But fortunately you don’t have to agree with my (rather controversial) position that numerical probabilities do not objectively exist. It is fine to assume that coins and other randomizing devices are objectively random, in the sense that they give rise to data that are so unpredictable as to be indistinguishable from those we would expect to arise from ‘objective’ probabilities. So we generally act as if the observations are random, even when we know that this is not strictly true. The most extreme examples of this are pseudo-random-number generators, which are in fact based on logical and completely predictable calculations. They contain no randomness whatsoever, but their mechanism is so complex that they are in practice indistinguishable from truly random sequences—say, those obtained from a source of subatomic particles.*

This somewhat bizarre ability to act as if something is true, when you know it really isn’t, would usually be considered dangerously irrational. However, it will come in handy when it comes to using probability as a basis for the statistical analysis of data.


We now come to the crucial but difficult stage of laying out the general connection between probability theory, data and learning about whatever target population we are interested in.

Probability theory naturally comes into play in what we shall call situation 1:

1. When the data-point can be considered to be generated by some randomizing device, for example when throwing dice, flipping coins, or randomly allocating an individual to a medical treatment using a pseudo-random-number generator, and then recording the outcomes of their treatment.

 

But in practice we may be faced with situation 2:

2. When a pre-existing data-point is chosen by a randomizing device, say when selecting people to take part in a survey.

 

And much of the time our data arises from situation 3:

3. When there is no randomness at all, but we act as if the data-point were in fact generated by some random process, for example in interpreting the birth weight of our friend’s baby.

 

Most expositions do not make these distinctions clear: probability is generally taught using randomizing devices (situation 1) and statistics is taught through the idea of ‘random sampling’ (situation 2), but in fact the majority of applications of statistics do not involve any random devices or random sampling whatsoever (situation 3).

But first consider situations 1 and 2. Just before we operate the randomizing device, we assume we have a set of possible results that might be observed, together with their respective probabilities—for example a coin can be heads or tails, each with probability of ½. If we associate each of these possible outcomes with a quantity, say in this case 0 for tails and 1 for heads, then we say we have a random variable with a probability distribution. In situation 1, the randomizing device ensures the observation is generated at random from this distribution, and when it is observed, the randomness has gone and all these potential futures have collapsed down on to the actual observation.* Similarly, in situation 2, if we draw an individual at random and, say, measure their income, then we have essentially drawn an observation at random from a population distribution of incomes.

So probability is clearly relevant when we have a randomizing device. But most of the time we simply consider all the measurements available to us at the time, which may have been collected informally, or, as we saw in Chapter 3, even represent every possible observation: think of survival rates for children’s heart surgery at different hospitals or all examination results for British children—both of these comprise all the data available, and there has been no random sampling.

In Chapter 3 we discussed the idea of a metaphorical population, comprising the possible eventualities that might have occurred, but mainly didn’t. We now need to brace ourselves for an apparently irrational step: we need to act as if data were generated by a random mechanism from this population, even though we know full well that it was not.

 

 

If We Observe Everything, Where Does Probability Come In?

 

How often do we expect to see seven or more separate homicide incidents in England and Wales in a single day?

 

When extreme events happen in close succession, such as multiple plane crashes or natural disasters, there is a natural propensity to feel they are in some sense linked. It then becomes important to work out just how unusual such events are, and the following example shows how we can make such a call.

To assess how rare a ‘cluster’ of at least seven homicides in a day might be, we can examine data for the three years (1,095 days) between April 2014 and March 2016, in which there were 1,545 homicide incidents in England and Wales, an average of 1,545/1,095 = 1.41 per day.* Over this period there were no days with seven or more incidents, but it would be very naïve to therefore conclude that such an occurrence was impossible. If we can build a reasonable probability distribution for the number of homicides per day, then we can answer the question posed.

But what is the justification for building a probability distribution? The number of homicides recorded each day in a country is simply a fact—there has been no sampling, and there is no explicit random element generating each unfortunate event. Just an immensely complex and unpredictable world. But whatever our personal philosophy behind luck or fortune, it turns out that it is useful to act as if these events were produced by some random process driven by probability.

It might be useful to imagine that at the start of each day we have a large population of people, each of whom has a very small possibility of being a homicide victim. Data of this kind can be represented as observations from a Poisson distribution, which was originally developed by Siméon Denis Poisson in France in the 1830s to represent the pattern of wrongful convictions per year. Since then it has been used to model everything from the number of goals scored by a football team in a match or the number of winning lottery tickets each week, to the number of Prussian officers kicked to death by their horses each year. In each of these situations there is a very large number of opportunities for an event to happen, but each with a very low chance of occurrence, and this gives rise to the extraordinarily versatile Poisson distribution.

Whereas the normal (or Gaussian) distribution in Chapter 3 required two parameters—the population mean and standard deviation—the Poisson distribution depends only on its mean. In our current example this is the expected number of homicide incidents each day, which we assume to be 1.41, the average number each day over this three-year period. We should, though, carefully check whether the Poisson is a reasonable assumption, so that it is reasonable to act as if the number of homicides each day were a random observation drawn from a Poisson distribution with mean 1.41.

For example, just from knowing this average, we can use the formula for the Poisson distribution, or standard software, to calculate that there would be a probability of 0.01134 of exactly five homicides occurring in a day, which means that over 1,095 days we would expect 1,095 × 0.01134 = 12.4 days on which there were precisely five homicide incidents. Amazingly, the actual number of days over a three-year period on which there were five homicides was… 13.

Figure 8.5 compares the expected distribution of the daily number of homicide incidents based on a Poisson assumption, and the actual empirical data distribution over these 1,095 days—the match is very close indeed, and in Chapter 10 I will show how to test formally whether the Poisson assumption is justified.

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)