Home > The Art of Statistics How to Learn from Data(21)

The Art of Statistics How to Learn from Data(21)
Author: David Spiegelhalter

Figure 6.3

A classification tree for the Titanic data in which a sequence of questions leads a passenger to the end of a branch, at which point they are predicted to survive if the proportion of similar people in the training set who survived is greater than 50%; these surviving proportions are shown at the bottom of the tree. The only people predicted to survive are third-class women and children from smaller families, and all women and children in first and second class, provided they do not have rare titles.

Before seeing how such a tree is actually constructed, we need to decide what performance measures to use in our competition.

Assessing the Performance of an Algorithm

If algorithms are going to compete to be the most accurate, someone has to decide what ‘accurate’ means. In Kaggle’s Titanic challenge this is simply the percentage of passengers in the test set that are correctly classified, and so after competitors build their algorithm, they upload their predictions for the response variable in the test set and Kaggle measures their accuracy.* We will present results for the whole test set at once (emphasizing that this is not the same as the Kaggle test set).

The classification tree shown in Figure 6.3 has an accuracy of 82% when applied to the training data on which it was developed. When the algorithm is applied to the test set the accuracy drops slightly to 81%. The numbers of the different types of errors made by the algorithm are shown in Table 6.1—this is termed the error matrix, or sometimes the confusion matrix. If we are trying to detect survivors, the percentage of true survivors that are correctly predicted is known as the sensitivity of the algorithm, while the percentage of true non-survivors that are correctly predicted is known as the specificity. These terms arise from medical diagnostic testing.

Although the overall accuracy is simple to express, it is a very crude measure of performance and takes no account of the confidence with which a prediction is made. If we look at the tips of the branches of the classification tree, we can see that the discrimination of the training data is not perfect, and at all branches there are some who survive and some not. The crude allocation rule simply chooses the outcome in the majority, but instead we could assign to new cases a probability of surviving corresponding to the proportion in the training set. For example, someone with the title ‘Mr’ could be given a probability of 16% of surviving, rather than a simple categorical prediction that they will not survive.

Algorithms that give a probability (or any number) rather than a simple classification are often compared using Receiver Operating Characteristic (ROC) curves, which were originally developed in the Second World War to analyse radar signals. The crucial insight is that we can vary the threshold at which people are predicted to survive. Table 6.1 shows the effect of using a threshold of 50% to predict someone a ‘survivor’, giving a specificity and sensitivity in the training set of 0.84 and 0.78 respectively. But we could have demanded a higher probability in order to predict someone survives, say 70%, in which case the specificity and sensitivity would have been 0.98 and 0.50 respectively—with this more stringent threshold, we only identify half the true survivors but make very few false claims of surviving. By considering all possible thresholds for predicting a survivor, the possible values for the specificity and sensitivity form a curve. Note that the specificity axis conventionally decreases from 1 to 0 when drawing an ROC curve.

Table 6.1

Error matrix of classification tree on training and test data, showing accuracy (% correctly classified), sensitivity (% of survivors correctly classified) and specificity (% of non-survivors correctly classified).

Figure 6.4 shows the ROC curves for training and test sets. A completely useless algorithm that assigns numbers at random would have a diagonal ROC curve, whereas the best algorithms will have ROC curves that move towards the top-left corner. A standard way of comparing ROC curves is by measuring the area underneath them, right down to the horizontal—this will be 0.5 for a useless algorithm, and 1 for a perfect one that gets everyone right. For our Titanic test set data, the area under the ROC curve is 0.82. It turns out that there is an elegant interpretation of this area: if we pick a true survivor and a true non-survivor at random, there is an 82% chance that the algorithm gives the true survivor a higher probability of surviving than the true non-survivor. Areas above 0.8 represent fairly good discriminatory ability.

The area under the ROC curve is one way of measuring how well an algorithm splits the survivors from the non-survivors, but it does not measure how good the probabilities are. And the people who are most familiar with probabilistic predictions are weather forecasters.

Suppose we want to predict whether or not it will rain tomorrow at a particular time and place. Basic algorithms might simply produce a yes/no answer, which might end up being right or wrong. More sophisticated models might produce a probability of it raining, which allows more fine-tuned judgements—the action you take if the algorithm says a 50% chance of rain might be rather different than if it says 5%.

Figure 6.4

ROC curves for the classification tree of Figure 6.3 applied to training (dashed line) and test (solid line) sets. ‘Sensitivity’ is the proportion of survivors correctly identified. ‘Specificity’ is the proportion of non-survivors correctly labelled as not surviving. Areas under curves are 0.84 and 0.82 for training and test sets respectively.

How do we know how good ‘probability of precipitation’ forecasts are?

In practice weather forecasts are based on extremely complex computer models which encapsulate detailed mathematical formulae representing how weather develops from current conditions, and each run of the model produces a deterministic yes/no prediction of rain at a particular place and time. So to produce a probabilistic forecast, the model has to be run many times starting at slightly adjusted initial conditions, which produces a list of different ‘possible futures’, in some of which it rains and in some it doesn’t. Forecasters run an ‘ensemble’ of, say, fifty models, and if it rains in five of those possible futures in a particular place and time, they claim a ‘probability of precipitation’ of 10%.

But how do we check how good these probabilities are? We cannot create a simple error matrix as in the classification tree, since the algorithm is never declaring categorically whether it will rain or not. We can create ROC curves, but these only examine whether days when it rains get higher predictions than when it doesn’t. The critical insight is that we also need calibration, in the sense that if we take all the days in which the forecaster says 70% chance of rain, then it really should rain on around 70% of those days. This is taken very seriously by weather forecasters—probabilities should mean what they say, and not be either over- or under-confident.

Figure 6.5

Calibration plot for the simple classification tree that provides probabilities of surviving the Titanic sinking, in which the observed proportion of survivors on the y-axis is plotted against the predicted proportion on the x-axis. We want the points to lie on the diagonal, showing the probabilities are reliable and mean what they say.

Hot Books

» House of Earth and Blood (Crescent City #1)
» A Kingdom of Flesh and Fire
» From Blood and Ash (Blood And Ash #1)
» A Million Kisses in Your Lifetime
» Deviant King (Royal Elite #1)
» Den of Vipers
» House of Sky and Breath (Crescent City #2)
» The Queen of Nothing (The Folk of the Air #
» Sweet Temptation
» The Sweetest Oblivion (Made #1)
» Chasing Cassandra (The Ravenels #6)
» Wreck & Ruin
» Steel Princess (Royal Elite #2)
» Twisted Hate (Twisted #3)
» The Play (Briar U Book 3)

The Art of Statistics How to Learn from Data(21) Author: David Spiegelhalter

The Art of Statistics How to Learn from Data(21)
Author: David Spiegelhalter