*Read Part I*

Part II

What we’re after is a score that calculates how close a prediction is to its eventual observation when the prediction is a probability. Now there are no such things as unconditional probabilities (see the Classic Posts page at the upper right for dozens of articles on this subject). Every probability is thus conditional on some evidence, i.e. premises. This evidence is called the model. Assuming no mistakes in computation or measurement (which, unfortunately, are not as rare as you’d like), the probability put out by the model is correct.

The probability is correct. But what about the model? If the model itself were correct, then every probability it announced would be 0 or 1, depending on whether the propositions put to it were deduced as true or false. Which is to say, the model would be causal and always accurate. We thus assume the models will not be correct, but its probabilities are.

Enter scoring rules. These are some function of the prediction and outcome, usually a single number where lower is better (even when the forecast is a vector/matrix). Now there is a bit of a complexity involving proper scoring rules (a lot of fun math), but which don’t make much real-life difference. A proper score is defined as one which (for our purposes) is “honest”. Suppose our scoring rule awarded all forecasts for probabilities greater than 0.9 a 0 regardless of the outcome, but used (probability – I(event occurred))^2 for probabilities 0.9 or less (the I() is an indicator function). Obviously, any forecaster seeking a reward would forecast all probabilities greater than 0.9 regardless what his model said. (The analogy to politics is obvious here.) What we’re after are scores which are “fair” and which accurately rate forecast performance. These are called proper.

A common score for dichotomous outcomes is as above: (probability – I(event occurs))^2, or its mean in a collection of forecast-observation pairs. This is the Brier score, which is proper. The score is symmetric, meaning the penalty paid for having large probabilities and no outcome equals the penalty paid for having small probabilities and an outcome. Yet for many decisions, there is an asymmetry. You’d most likely feel less bad about a false positive on a medical test for pancreatic cancer than for a false negative—though it must be remembered that false positives are not free. Even small costs for false positives add up when screening millions. But that is a subject for another time.

The Brier score is thus a kind of compromise when the actual decisions made based on forecasts for dichotomous events aren’t known. If the costs are known, and they are not symmetric, the Brier score holds no meaning. Neither does any score which doesn’t match the decisions made with the forecast! (I keep harping on this hoping it won’t be forgotten.)

There are only main reasons to score a model. (1) To reward a winner: more than one model is in competition and something has to judge between them. (2) As a way to understand particularities of model performance: suppose our model did well—it had high probabilities—when events happened, but it did poorly—it still had high-ish probabilities—when events didn’t happen; this is good to know. (3) As a way to understand future performance.

Senate hearing on EPA CO2 rules: EPA air chief says climate change science is 'clear.' Yes, — climate models failed. pic.twitter.com/815hke0lKF

— JunkScience.com (@JunkScience) February 11, 2015

The first reason brings in the idea of skill scores. You have a collection of temperature model forecasts and its proper score; say, it’s a 7. What does that 7 mean? What indeed? If your score was relevant to the decisions you made with the forecast, then you wouldn’t ask. It’s only because we’re releasing a general-purpose model into the wild which may be used for any number of different reasons that we must settle on a compromise score. That score might be the complete rank probability score (CRPS; see Gneiting and Raftery, Strictly Proper Scoring Rules, Prediction, and Estimation, 2007, *JASA*, 102, pp 359–378), which has many nice properties. This takes a probability forecast for a thing (which itself gives the probability of every possible thing that can happen) and the outcome and measures the “distance” between them (not all scores are distances in the mathematical sense). It gives a number. But how to rate it?

There has to be a comparator, which is usually a model which is known to be inferior in some aspect. Take regression (see the Classic Posts for what regression really is; don’t assume you know, because you probably don’t). A complex model might have half a dozen or more “x” variables, the things we’re using to predict the “y” variable or outcome with. A naive model is one with no x variables. This is the model which always says the “y” outcome will be this-and-such fixed probability distribution. Proper scores are calculated for the complex and naive model and their difference (or some function of their difference) is taken.

This is the *skill score*. The complex model has *skill* with respect to the simpler model if it has a better proper score. It’s usually worked so that skill scores greater than 0 indicate superiority. If a complex model does not have skill with respect to the simpler model, the complex model should not be used. (Unless, of course, your real-life decision score is completely different than the proper score underlying skill.)

We’ll wrap it up in Part III.

**Update** Here is an example of an improper score in real life. Tanking in sports.