Home > The Art of Statistics How to Learn from Data(22)

The Art of Statistics How to Learn from Data(22)
Author: David Spiegelhalter

 

 

Calibration plots allow us to see how reliable the stated probabilities are, by collecting together, say, the events given a particular probability of occurrence, and calculating the proportion of such events that actually occurred.

Figure 6.5 shows the calibration plot for the simple classification tree applied to the test set. We want the points to lie near the diagonal since that is where the predicted probabilities match the observed percentages. The vertical bars indicate a region in which we would, given reliable predicted probabilities, expect the actual proportion to lie in 95% of cases. If these include the diagonal line, as in Figure 6.5, we can consider our algorithm to be well calibrated.

 

 

A Combined Measure of ‘Accuracy’ for Probabilities


While the ROC curve assesses how well the algorithm splits the groups, and the calibration plot checks whether the probabilities mean what they say, it would be best to find a simple composite measure that combines both aspects into a single number we could use to compare algorithms. Fortunately weather forecasters back in the 1950s worked out exactly how to do this.

If we were predicting a numerical quantity, such as the temperature at noon tomorrow in a particular place, the accuracy would usually be summarized by the error—the difference between the observed and predicted temperature. The usual summary of the error over a number of days is the mean-squared-error (MSE)—this is the average of the squares of the errors, and is analogous to the least-squares criterion we saw used in regression analysis.

The trick for probabilities is to use the same mean-squared-error criterion as when predicting a quantity, but treating a future observation of ‘rain’ as taking on the value 1, and ‘no rain’ as being 0. Table 6.2 shows how this would work for a fictitious forecasting system. On Monday a probability of 0.1 is given to rain, but it turns out to be dry ( true response is 0), and so the error is 0–0.1 =–0.1. This is squared to give 0.01, and so on across the week. Then the average of these squared errors, B = 0.11, is a measure of the forecaster’s (lack of) accuracy.* The average mean-squared-error is known as the Brier score, after meteorologist Glenn Brier, who described the method in 1950.

Unfortunately the Brier score is not easy to interpret on its own, and so it is difficult to get a feeling of whether any forecaster is doing well or badly; it is therefore best to compare it with a reference score derived from historical climate records. These ‘climate-based’ forecasts take no notice whatever of current conditions and simply state the probability of precipitation as the proportion of times in climate history in which it rained on this day. Anyone can make this forecast without any skill whatsoever—in Table 6.2 we assume this means quoting a 20% probability of rain for every day that week. This gives a Brier score for climate (which we call BC) of 0.28.

*

 

 

Table 6.2

Fictional ‘probability of precipitation’ forecasts of whether it will rain or not at midday next day at a specific location, with the observed outcome: 1 = did rain, 0 = did not rain. The ‘error’ is the difference between the predicted and observed outcome, and the mean-squared-error is the Brier score (B). The climate Brier score (BC) is based on using simple long-term average proportions of rain at this time of year as probabilistic forecasts, in this case assumed to be 20% for all days.


Any decent forecasting algorithm should perform better than predictions based on climate alone, and our forecast system has improved the score by BC—B = 0.28–0.11 = 0.17. Forecasters then create a ‘skill score’, which is the proportional reduction of the reference score: in our case, 0.61,* meaning our algorithm has made a 61% improvement on a naïve forecaster who uses only climate data.

Clearly our target is 100% skill, but we would only get this if our observed Brier score is reduced to 0, which only happens if we exactly predict whether it will rain or not. This is expecting rather a lot of any forecaster, and in fact skill scores for rain forecasting are now around 0.4 for the following day, and 0.2 for forecasting a week in the future.2 Of course the laziest prediction is simply to say that whatever happened today will also happen tomorrow, which provides a perfect fit to historical data (today), but may not do particularly well in predicting the future.

When it comes to the Titanic challenge, consider the naïve algorithm of just giving everyone a 39% probability of surviving, which is the overall proportion of survivors in the training set. This does not use any individual data, and is essentially the equivalent to predicting weather using climate records rather than information on the current circumstances. The Brier score for this ‘skill-less’ rule is 0.232.

In contrast, the Brier score for the simple classification tree is 0.139, which is a 40% reduction from the naïve prediction, and so demonstrates considerable skill. Another way of interpreting this Brier score of 0.139 is that it is exactly what would be obtained had you given all survivors a 63% chance of surviving, and all non-survivors a 63% chance of not surviving.

We shall see if we can improve on this score with some more complicated models, but first we need to issue a warning that they should not get too complicated.

 

 

Over-fitting


We do not need to stop at the simple classification tree shown in Figure 6.3. We could go on making the tree more and more complex by adding new branches, and this will allow us to correctly classify more of the training set as we identify more and more of its idiosyncrasies.

Figure 6.6 shows such a tree, grown to include many detailed factors. This has an accuracy on the training set of 83%, better than the smaller tree. But when we apply this algorithm to the test data its accuracy drops to 81%, the same as the small tree, and its Brier score is 0.150, clearly worse than the simple tree’s 0.139. We have adapted the tree to the training data to such a degree that its predictive ability has started to decline.

This is known as over-fitting, and is one of the most vital topics in algorithm construction. By making an algorithm too complex, we essentially start fitting the noise rather than the signal. Randall Munroe (the cartoonist known for his xkcd comic strip) produced a fine illustration of over-fitting, by finding plausible ‘rules’ that US Presidents had followed, only for each to be broken at subsequent elections.3 For example,

• ‘No Republican has won without winning the House or Senate’—until Eisenhower did in 1952.

• ‘Catholics can’t win’—until Kennedy in 1960.

• ‘No one has been elected President after a divorce’—until Reagan in 1980.

 

 

Figure 6.6

Over-fitted classification tree for the Titanic data. As in Figure 6.3, the percentage at the end of each branch is the proportion of passengers in the training set who survived, and a new passenger is predicted to survive if this percentage is greater than 50%. The rather strange set of questions suggests the tree has adapted too much to individual cases in the training set.

 

 

and so on, including some clearly over-refined rules such as

• ‘No Democratic incumbent without combat experience has beaten someone whose first name is worth more in Scrabble’—until Bill (6 Scrabble points) Clinton beat Bob (7 Scrabble points) Dole in 1996.

Hot Books
» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)