Home > The Art of Statistics How to Learn from Data(31)

The Art of Statistics How to Learn from Data(31)
Author: David Spiegelhalter

In answer to the question posed at the start of this section, we can calculate from the Poisson distribution the probability of getting seven or more incidents in a day, which turns out to be 0.07%, and means that we can expect such an event to happen on average every 1,535 days, or roughly once every four years. We can conclude that this event is fairly unlikely to happen in the normal run of things, but is not impossible.

The fit of this mathematical probability distribution to the empirical data is almost disturbingly good. Even though there is a unique story behind every one of these tragic events, most of which are unpredictable, the data act as if they were actually generated by some known random mechanism. One possible view is to think that other people could have been murdered, but they weren’t—we have observed one of many possible worlds that could have occurred, just as when we flip coins we observe one of the many possible sequences.

 

 

Figure 8.5

Observed and expected (assuming a Poisson distribution) daily number of recorded homicide incidents, 2014 to 2016, England and Wales.3

 

 

Adolphe Quetelet was an astronomer, statistician and sociologist in Belgium in the mid 1800s, and was one of the first to draw attention to the astonishing predictability of overall patterns made up of individually unpredictable events. He was intrigued by the occurrence of normal distributions in natural phenomena, such as the birth-weight distribution in Chapter 3, and coined the idea of ‘l’homme moyen’ (the average man), who took on the mean value of all these characteristics. He developed the idea of ‘social physics’, since the regularity of societal statistics seemed to reflect an almost mechanistic underlying process. Just as the random molecules of a gas come together to make predictable physical properties, so the unpredictable workings of millions of individual lives come together to produce, for example, national suicide rates that barely change from year to year.


Fortunately we don’t have to believe that events are actually driven by pure randomness (whatever that is). It is simply that an assumption of ‘chance’ encapsulates all the inevitable unpredictability in the world, or what is sometimes termed natural variability. We have therefore established that probability forms the appropriate mathematical foundation for both ‘pure’ randomness, which occurs with subatomic particles, coins, dice, and so on; and ‘natural’, unavoidable variability, such as in birth weights, survival after surgery, examination results, homicides, and every other phenomenon that is not totally predictable.

In the next chapter we come to a truly remarkable development in the history of human understanding: how these two aspects of probability can be brought together to provide a rigorous basis for formal statistical inference.

 

 

Summary


• The theory of probability provides a formal language and mathematics for dealing with chance phenomena.

• The implications of probability are not intuitive, but insights can be improved by using the idea of expected frequencies.

• The ideas of probability are useful even when there is no explicit use of a randomizing mechanism.

• Many social phenomena show a remarkable regularity in their overall pattern, while individual events are entirely unpredictable.

 

 

CHAPTER 9


Putting Probability and Statistics Together


Warning. This is perhaps the most challenging chapter in this book, but persevering with this important topic will give you valuable insights into statistical inference.

 

In a random sample of 100 people, we find that 20 are left-handed. What can we say about the proportion of the population who are left-handed?

 

 

In the last chapter we discussed the idea of a random variable—a single data-point drawn from a probability distribution described by parameters. But we are seldom interested in just one data-point—we generally have a mass of data which we summarize by determining means, medians and other statistics. The fundamental step we will take in this chapter is to consider those statistics as themselves being random variables, drawn from their own distributions.

This is a big advance, and one that has not only challenged generations of students of statistics, but also generations of statisticians who have tried to work out what distributions we should assume these statistics are drawn from. And given the discussion of the bootstrap in Chapter 7, it would be reasonable to ask why we need all that mathematics, when we can work out uncertainty intervals and so on using simulation-based bootstrap approaches. For example, the question posed at the start of this chapter could be answered by taking our observed data of 20 left-handed and 80 right-handed individuals, and repeatedly resampling 100 observations from this data set, with replacement, and looking at the distribution of the observed proportion of left-handed people.

But these simulations are clumsy and time-consuming, especially with large data sets, and in more complex circumstances it is not straightforward to work out what should be simulated. In contrast, formulae derived from probability theory provide both insight and convenience, and always lead to the same answer since they don’t depend on a particular simulation. But the flip side is that this theory relies on assumptions, and we should be careful not to be deluded by the impressive algebra into accepting unjustified conclusions. We will explore this in more detail later, but first, having already appreciated the value of the normal and Poisson, we need to introduce another important probability distribution.


Suppose we draw samples of different sizes from a population containing exactly 20% left- and 80% right-handed people, and calculate the probability of observing different possible proportions of left-handers. Of course this is the wrong way round—we want to use the known sample to learn about the unknown population—but we can only get to this conclusion by first exploring how a known population gives rise to different samples.

The simplest case is a sample of one, when the observed proportion must be either 0 or 1 depending on whether we select a right- or left-hander—and these events occur with probability of 0.8 and 0.2 respectively. The resulting probability distribution is shown in Figure 9.1(a).

If we take two individuals at random, then the proportions of left-handers will either be 0 (both right-handers), 0.5 (one of each) or 1 (both left-handers). These events will occur with probabilities 0.64, 0.32 and 0.04 respectively,* and this probability distribution is shown in Figure 9.1(b). Similarly we can use probability theory to work out the probability distribution for the observed proportions of left-handers in the 5-, 10-, 100- and 1,000-person samples, which are all shown in Figure 9.1. These distributions are based on what is known as the binomial distribution, and can also tell us the probability, for example, of getting at least 30% left-handed people if we sample 100, known as a tail-area.

The mean of a random variable is also known as its expectation, and in all these samples we expect a proportion of 0.2 or 20%: all the distributions shown in Figure 9.1 have 0.2 as their mean. The standard deviation for each is given by a formula which depends on the underlying proportion, in this case 0.2, and the sample size. Note that the standard deviation of a statistic is generally termed the standard error, to distinguish it from the standard deviation of the population distribution from which it derives.

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)