# How Good Is That Model? Scoring Rules For Forecasts: Part I

Part I of III

All probability (which is to say, statistical) models have a predictive sense; indeed, they are only really useful in that sense. We don’t need models to tell us what happened. Our eyes can do that. Formal, hypothesis testing, i.e. chasing after statistical “significance”, leads to great nonsense and the cause of many interpretational errors. We need models to quantify the uncertainty of what has not yet been measured or made known to us. Throughout this series I take models in that sense (as all should).

Which is this. A model—a set of premises—is used to make predictions about some observable Y, a proposition. For example, a climate model might predict what the (operationally defined) global mean surface temperature will be at some time, and Y is the proposition “The temperature was observed at the time to be y”. What I have to say applies to all probability models of observable events. But I’ll use temperature as a running example because of its familiarity.

If a model said “The temperature at the time will be x” but was really y, then the model has been falsified. The model is not true. Something is wrong with the model. The model said x would occur but y did. The model is falsified because it implied x would happen with certainty. Now the model may have always hit at every time up to this point, and it may continuing hitting forever after, but it missed this time and all it takes is one mistake for a model to be falsified.

Incidentally, any falsified model must be tossed out. By which I mean that it must be replaced with something new. If any of the premises in a model are changed, even the smallest least consequential one, strictly the old model becomes a new one.

But nobody throws out models for small mistakes. If our model predicted accurately every time point but one we’d be thrilled. And we’d be happy if “most of the time” our forecasts weren’t “too far off.” What gives? Since we don’t reject models which fail a single time or are not “too far off”, there must be hidden or tacit premises to the model. What can these look like?

Fuzz. A blurring that takes crystalline predictions and adds uncertainty to them, so that when we hear “The temperature will be x” we do not take the words at their literal meaning, and instead replace them with “The temperature will be about x”, where “about” is happily left vague. And this is not a problem because not all probability is (or should be!) quantifiable. This fuzz, quantified or not, saves the model from being falsified. Indeed, no probability model can ever be falsified unless that model becomes (at some point) dogmatic and says “X cannot happen” and we subsequently observe X.

Whether the fuzzy premises—yes, I know about fuzzy logic, the rediscovery and relabeling of classic probability, keeping all the old mistakes and adding in a few new ones—are put there by the model issuer or you is mostly irrelevant (unless you’re seeking whom to blame for model failure). The premises are there and keep the models from suffering fatal epistemological blows.

Since the models aren’t falsified, how do we judge how good they are? The best and most basic principle is how useful the models were to those who relied upon them. This means a good model to one person can be a poor one to another. A farmer may only care whether temperature predictions were accurate at distinguishing days below freezing, whereas the logistics manager of a factory cares about exact values for use in ordering heating oil. An environmentalist may only care that the forecast is one of doom while being utterly indifferent (or even hostile) to the actual outcome, so that he can sell his wares. The answer to “What makes a good model” is thus “it depends.”

Of course, since many decisions fall into broad categories we can say a little more. But in so saying, we must always remember that goodness depends on actual use.

Consider the beautiful game of petanque, wherein manly steel balls are thrown towards a target. Distance to the target is the measure of success. The throw may be thought of as a model forecast of 0 (always 0) and the observation the distance to the target. Forecast goodness is taken as that distance. Linear distance, or its average over the course of many forecasts, is thus a common measure of goodness. But only for those whose decisions are a linear function of the forecast. This is not the farmer seeking frost protection. Mean error (difference between forecast and observation) probably isn’t generally useful. One forecast error of -100 and another of +100 average to 0, which is highly misleading—but only to those who didn’t use the forecasts!

You can easily imagine other functions of error as goodness measures. But since our mathematical imagination is fecund, and since there are an infinite number of functions, there will be no end to these analyses, a situation which at least provides us with an endless source of bickering. So it might be helpful to have other criteria to narrow our gaze. We also need ways to handle the fuzz, especially when it has been formally quantified. That’s to come.

Update Due to various scheduling this-and-thats, Part II of this series will run on Friday. Part III will run either Monday or Tuesday.

## 23 thoughts on “How Good Is That Model? Scoring Rules For Forecasts: Part I” Leave a comment ›

1. Classifying the fuzz:
Dust bunny
Dust rhino
Dust t-rex

(Sorry, couldn’t resist.)

2. I still think box plots (within +/- 25% of median or some such limits) would be a simple, non-parametric way of portraying data and seeing where the forecast fits in with observations.

3. JH says:

This means a good model to one person can be a poor one to another….But in so saying, we must always remember that goodness depends on actual use.

I think we should take this reasoning further. Since Taylor’s theorem has no actual use to many people, it is not good.

You can easily imagine other functions of error as goodness measures…since there are an infinite number of functions, there will be no end to these analyses, a situation which at least provides us with an endless source of bickering.

True, there are infinite numbers of functions, but not all of them can serve as a goodness measure. Seriously, if you can easily come up with a new function/criteria, you should do so and try to justify why it can serve as a goodness measure!

4. Is a model good at its job?
What’s it being use for?
Is it fit for that purpose?

The argument I’ve seen expressed here a few times is: But Mr Brigg’s test is unfair as no model of this type could pass his test. To which my reply is, is the test relevant to evaluating how the model is being used? If so, the test does fairly demonstrate how the model is being misused.

5. Briggs says:

Will,

Quite right. There is of course an enormous literature on (as it’s called) forecast verification. Everything from this series follows the mainstream of that field. It’s one of the few areas of statistical practice that gets things (mostly) right.

Bob,

Box plots have only a limited use. They can’t be implemented for dichotomous or field (vector of grid) variables.

JH,

Sorry to hear you don’t like Taylor series. That puts you in the minority (of one, I think) of mathematicians. Dare to be different!

6. Thanks Matt, but for one variable being measured (e.g. temperature) would your objection apply? And for a dichotomous variable, could not two-dimensional perspectives of three-dimensional box plots be interesting?

7. “Climate models have to be tested to find out if they work. We can’t wait for 30 years to see if a model is any good or not; models are tested against the past, against what we know happened. If a model can correctly predict trends from a starting point somewhere in the past, we could expect it to predict with reasonable certainty what might happen in the future.”

Weatherunderground says if the model can backcast, then logically it can forecast. Right? (Maybe not…….)

8. Briggs says:

Sheri,

If Weatherunderground says that they’re wrong. Besides, since it is claimed climate models backcast well, then they should forecast well. Do they? No, they do not.

And they ought to know better because the kinds of measures we discuss in this series are routinely applied to weather prediction models.

Bob,

I can envision some small places they’re useful, but they are still limited. Try dot plots or other mechanisms.

9. Briggs: Weatherunderground did indeed say that. While I like to think of myself as creative, I don’t think I could make that kind of thing up (nor would I want to).

I’ve read several places on how weather forecasts are scored as “accurate” or “not accurate”. One would think that accurate means within say 5 degrees of the predicted high and rain if it’s forecast, dry if it’s not, but not really. Scoring model outcomes is a very, very fuzzy business!

10. “Weatherunderground says if the model can backcast, then logically it can forecast. Right? (Maybe not…….)”

There is a very informative and educational read, but not difficult, focused on this question:

http://www.amazon.com/The-Predictors-Maverick-Physicists-Fortune/dp/0805057579

The question was, can one build computer models that predicted the market? And hence become unfathomably rich. Hopefully my spoiling of the plot won’t ruin the tale itself. The bottom line was they created models that impressively modeled the past but never did manage to model the future. (Contrary to the misleading sub head.)

11. Actual sub head designed to sell as many copies as possible:

“The Predictors: How a Band of Maverick Physicists Used Chaos Theory to Trade Their Way to a Fortune on Wall Street”

“The Predictors: How a Band of Maverick Physicists Used Chaos Theory (and anything else they could think of) To Fool Themselves Into Believing They Could Trade Their Way to a Fortune on Wall Street, But Ended Up As Failures”

Although the honest version doesn’t have the same ring to it…

12. John B() says:

Will

I THINK the market predictor program was covered here once

13. PhysicsGroup says:

The models are wrong because of the initial assumption that without GH gases the troposphere would have been isothermal. We know this assumption is made because we know the 255K temperature is at about 5Km altitude, and yet they say the surface would have been the same 255K. From there they get their sensitivity by assuming water vapor makes rain forests about 30 to 40 degrees hotter than dry regions and carbon dioxide adds a bit of warming also. In fact none of that happens.

The assumption regarding isothermal conditions is inherently applying the Clausius “hot to cold” statement which is just a corollary of the Second Law which only applies in a horizontal plane. That we know because it is clearly specified (as here) that the entropy equation is derived by assuming that changes in molecular gravitational potential energy can be ignored. It is those changes which actually cause the temperature gradient to evolve, so we must always remember that sensible heat transfers are not always from warmer to cooler regions in a vertical plane in a gravitational field
.
So they cannot prove that the Clausius statement they use to get their assumed isothermal conditions is correct in a vertical column of a planet’s troposphere, and so they cannot prove the fundamental building block upon which they built the GH conjecture.

14. JH says:

Sorry to hear you don’t like Taylor series. That puts you in the minority (of one, I think) of mathematicians. Dare to be different!

I know, Mr. Briggs, isn’t sad? I was simply trying to apply your rule of “goodness depends on actual use.” In fact, I wanted to apply the rule to all mathematics theorems to conclude mathematics has no goodness at all. However, I thought it would’ve made you weep and howl, so I didn’t.

There exist objective criteria for judging whether a model performs well and for comparisons between two postulated models for the same data set. Whom a model usefully serves is not one of the criteria.
(Comparing model results produced by historical data of different lengths) is not fair.)

15. ‘I was simply trying to apply your rule of “goodness depends on actual use.” ‘

I don’t think this is Mr Brigg’s rule. Rather the rule of all people who are not insane. 😉

My flashy European sports car is excellent to drive in, but would not be so good to live in. I can criticize the car as a lousy place to live and sleep in, and such a criticism would, of course, be completely ‘unfair’ to my car. Because unlike a Winnebago, that’s not what it was designed for. The issue of fairness is irrelevant to the issue of usefulness. (I notice you subtly used the word ‘good’ instead of ‘useful’ to try to make your argument work.) If you keep failing to understand this distinction, you’re going to keep making the same mistake.

16. JH says:

Will N,

I don’t think this is Mr Brigg’s rule. Rather the rule of all people who are not insane.

I don’t know if it is the rule of all insane people. I do think that it’s not a well-defined, objective rule to judge the goodness of a statistical model or mathematics.

(I have heard people say that mathematicians are crazy and insane though.)

There are also differences between the goodness of a model and (fair) comparisons among different models/results. Yes, the issue of fairness is irrelevant to the issue of usefulness in this post, which is the reason that I used parentheses.

17. If that’s the only question you’re ever going to ask of a model, then the answer is only going to be of interest to those who are building similar types of models. Most of the time your answer will provide information nobody is interested in. For example, IPCC climate models are used by everyone to predict how global temperature will change in 10, 30, 50, 100 years. The question we want the answer to is, how well are they doing at this task? It’s a well defined question and its answer can be determined objectively.

18. Will, you make a good point, but let me also ask–do you believe the temperature reports by NOAA and other organizations? I don’t.

19. It’s a little frustrating. In my field (engineering) or even in philosophy, or to a great extent in science, all you need do is demonstrate a single failure of the method or argument, to force everyone back to the drawing board. With temperature modelling of this type there is no serious accountability. Mosher can come here and claim it’s all good, ignore clear examples of failures which disprove his assertions, and he and others can just hand wave the problems away as having no ‘material impact’ on their final results. And since nobody ‘really’ knows what the global temperature trends actually are, they are safe in the knowledge that even if they do bad quality work, it’s not going to be easy to notice. (Although people are starting to notice.) If I was a lousy engineer I’d work in this field, because I could get away with being sloppy and over confident. So I’m not taking these modelling efforts too seriously because of the lack of accountability.

But that doesn’t mean the planet isn’t gradually warming. It’s been warming at various rates, on and off, for 300 years. I have no reason to believe the long term trend is on the cusp of suddenly changing.