This is a technical addendum to the main series. I would have skipped this, but Climategate 2.0 revealed many misapprehensions of verification statistics that I want to clear up, particularly about R2 and skill. This will be fast and furious and directed at those already with sufficient background.
A model is created for some observable y. The model will be conditional on certain probative information x and at least some unobservable parameters θ All terms can of course be multidimensional. Classical procedure—in physics, climatology, statistics, wherever—first gathers a sample of (y,x) and uses this to find a best guess of the parameters, called hat-θ (no pretty way to do display this in HTML). The “hat” indicates a guess. It does not matter to us how this guess is derived, merely that it exists. Nobody—and I mean nobody—believes the guess to be perfectly accurate.
Next classical step is to form “residuals”, which are derived by plugging the hat-θ into the model and then back-solving for y: the results are called hat-y. From this we calculate R2, which is just the norm of (y – hat-y), i.e. of the “residuals”—in one-dimension, the norm is just the normalized sum of squared residuals.
The problem is that since nobody believes the guess of hat-θ, nobody should believe the residuals. If we base our verification solely on R2 we will too certain of ourselves. If you’ve based your confidence in the climate model based solely on measures like R2, or on any other norm/utility that takes as input a guess of the parameters, you think you know more than you do. This is utterly indisputable. Every temperature reconstruction I’ve ever seen uses R2-like measures for verification: they are thus too certain. To eliminate over-certainty, you must account for the inaccuracy in the guess of θ.
This is easy to do in Bayes: one simply integrates out the parameters, giving as a result the probability distribution of y given x. This speaks directly in terms of the observables and only assumes the model is true—which is what R2 also assumes, but R2 adds the assumption that hat-θ is error free.
Now it gets tricky. Using Bayes, we indeed have the probability distribution of new y’s given new x’s and given the information contained in the original sample (y,x). But all we have in front of us is that original sample. We can one of two things: one weak, one strong.
Weak: We see how well we could have predicted the old sample assuming it is new. We have Pr(y-new | x-new, (y,x) ), which is the prediction of new observables y given new observables x and given the information contained in the old sample (the parameters are integrated out). We take each pair of old data (y,x)i and use the x from this pair as the x-new. We produce the prediction Pr(y-new | xi, (y,x) ). We compare this prediction of y-new with yi. The prediction is of course a probability distribution, and yi is a (possibly multi-dimensional) point. But we can use things like the continuous ranked probability score (CRPS) or other measure to score this prediction. Many other scores exist which will work: use the one that makes most sense to a decision maker who uses the prediction.
This is weak because it double-uses the sample. But so does R2, and in just the same way. Everybody double-uses their sample. Even cross-validation is a double-use (or more!). It’s not wrong to do the double-use, since it does give some idea of model performance. But since—are you ready?—it is always—as in always—possible to build a model that fits (y,x) arbitrarily well, you will always—as in always—go away more confident about your model than you have a right to be.
I’ll repeat that: R2 (and similar measures) double-uses the original sample and does not account for uncertainty in the parameters. Over-certainty is not just likely, it is guaranteed. This is not Briggs’s opinion. This is true. Using Pr(y-new | xi, (y,x) ) also double-uses the original sample and also causes over-certainty.
Strong: Wait for x-new and y-new to come along—ensure they are never seen before in any way, brand-spanking new observables! Produce your prediction Pr(y-new | x-new, (y,x) ) and then compare it to the y-new using CRPS or whatever loss function makes sense to the user of the prediction. This is the only way—as in the only way—to avoid over-certainty. This again is not opinion, this is just true.
Physicists, chemists, electronics engineers and the like of course do this sort of thing all the time. They are not satisfied to produce one version of the model, report once on R2 and call it a day. They test the models out again and again, and on new data. Models that perform poorly are scrapped or re-built. Statisticians should do the same.
Skill: I often speak of model skill and I want to give the technical definition. Skill also comes in weak and strong versions.
Weak: Take the verification measure from your model as above—whether it’s R2, CRPS, whatever—and save it. Then build a new model which should look like your old model, except that it should be “simpler.” Perhaps in your original model the dimension of θ is 12, but in the simpler model it is only 7. The choice is yours. For climate models the natural choice is called “persistence”, which is model that says “the next time period will be exactly like this time period.” The choice of the simpler model should be directed by the question at hand.
Skill is when the more complex model beats the simpler model in terms of the verification score. Simple as that. If the more complex model cannot beat the simpler model, then the simpler model (of course) is better and should be used in preference to the complex model.
Of course, if you’ve used R2 or the weak-Bayes prediction (re-used the sample), your estimate of skill will be too certain. Re-use is still re-use, even for skill.
Strong: Same as the weak except the comparisons of performance are done on the model’s predicting new data, as above.
Climate models don’t have strong skill (as far as I’ve seen) at predicting yearly global average temperature. They do not (again, as far as I’ve seen) beat persistence. Thus, their predictions cannot yet be trusted.
Last word: a sufficient sample of performance measures must be built to demonstrate there is high probability that skill is positive. We build these models (of future skill) as we build all probability models.
We’re all weary of this, so that is all I want to say on models and model performance at this time.
On the no $\hat{\theta}$ in html: you could try MathJax. Just requires adding a one-liner to your post templates (here’s what I did for a blogger site).
“Climate models don’t have strong skill (as far as I’ve seen) at predicting yearly global average temperature. They do not (again, as far as I’ve seen) beat persistence. ”
This is simply not true. First, you don’t seem to grasp that climate models are not fits temperature trends. Second, here are two examples (and there are more) of models having skill with respect to persistence for the global mean temperature: Hargreaves (2010), & Hansen et al (1992). Third, if you want a statistical prediction of next years global mean temperature anomaly (not a particularly interesting number scientifically, but one that people pay attention to), trend+Dec Nino 3.4 would almost certainly beat persistence – though of course this isn’t a climate model in the sense that almost every uses.
I think you’ll find that it is you who is too certain in his conclusions, not the climate modellers.
I’m enjoying this series, thanks.
Could you say something about the pitfalls of judging skill by splitting your data into two parts, using the first part to set your parameters, and the second part of the data to judge the skill of the model?
Also when testing skill with a new model, is it imperative that the new model is less complicated?
Please do not comment on this since we’re all waiting on your climate model post(s), but I read that the recent “lower CO2 sensitivity” paper did judge the skill of other models with data derived from the Last Glacial Maximum and found that the high sensitivity models got stuck in an ice age due to positive feedback.
A minor point:
In the paragraph – “Now it gets tricky.” The last line starts – “We can one of two things …” Should read – “We can do one of two things … ”
A comment on emphasis:
It seems to me the absolutely fundamental issue about building a model is that one needs to be constantly aware that it is “possible to build a model that fits (y,x) arbitrarily well …” The statement is there but it is buried deep in the post.
So what the post does is say ‘keeping in mind the ways you can screw up if you don’t pay attention’ here is a list of things to keep in mind and do that will help you reduce the likelihood of seriously screwing up.
Of course, it’s more polite to talk about ‘skill’ and ‘strong’ and ‘weak’ but I think it’s useful to remind oneself that it is possible, indeed likely, that we can easily wander down the wrong path and, not to mince words, screw up.
What this series of essays, and many others along similar lines of thinking (e.g. addressing how p-values can be misleading, for example) shows is that the overall “community” of PhD non-statisticians applying statistical analyses to various real problems don’t seem to have much greater grasp of the subtleties involved than undergraduate students taking program-specific mandated (and only the mandated) statistics courses. That is, undergrad engineering, physics, etc. programs appear to remain deficient in ensuring students get more than the very basics – ‘enough to be dangerous,’ to paraphrase the cliché, is taught & little more (except to that small minority of people that purse the extra course or two beyond minimum degree mandated courses.
I recall this same basic complaint from a number of stat’s instructors moonlighting at U of Mich. – people whose primary jobs were with the auto industry where they really needed to get things right to avoid recalls, lawsuits, etc. Later, involved with some other technologies in another industry this same problem was voiced; there, one had to take training in other statistical techniques few if any even heard of and/or training regarding some familiar techniques that were tailored to specific technologies before becoming actively involved.
One thing you left out Mr. Briggs is the concept of a validation set. It’s very easy to set aside 20% of the samples and use them post-training/fitting to validate both the reference model and the new model.
In fairness to Gavin, he is quite right about “the” climate models; they’re not producing just one output.
Gavin; The press, and therefore most discussions, seem centered around global temperature anomaly. Given its topical nature, what is wrong with examining the forecast skill of this one attribute? Personally I agree with you that its not a very interesting number. Clouds and wind on the other hand…..
Gavin: the paper (2010) you referenced wasn’t even able to reproduce the results of the model they were reporting on. In addition, how many model runs we’re NOT able to outperform the null hypothesis? They mention at least 2 others, but that’s only for that given year!
I’m sorry, but that paper is a very strong case against accepting anything “statistics” from the people wanting to make forecasts using current climate models. The very suggestion of creating yet more models to help simulate the existing models so as to better understand predictive ability is shocking. I am floored by the authors implication that the current models are untestable in terms of skill. His conclusion that because one model, which can’t be reproduced, was able to do better than a reference way-back-when should mean current models must be okay is madness!
Hansen et al’s paper is a poster child for why everyone should be paying more attention to what this blog has been saying. Their paper is making the mistake Briggs has been talking about this entire time. The authors report probabilities of an event happening in real life given that the models are 100% skillfull. Even if this were true, we are presented with 3 different estimations of input parameters, so the reported results are, as Briggs has been explaining, over confident even if we assume the model is perfect!! Just because one prediction was right (and many others wrong; look at the 12 maps they provided) does not mean that the confidence of their predictions was correct.
Im not putting down the climate models. This is entirely about measuring and reporting uncertainty.
I have a resonable suppicion that x drives y, and I collect my x’s and my y’s.
I build my model and see an resonable fit. I collect some out of sample data. x still drives y in the out of sample data, but not as strongly as suggested by the original data set. It it acceptable to tweak my model with this new data, or is this considered “bad form”?
How long are we supposed to wait?
TURE? How about demonstrating your claim that it’s the only way to avoid over-certainty by
a) using the temperature data up to year 2008 from any station and
b) pretending that we have waited two years, i.e., we have data from 2009 and 2010.
How should an administrator make a yes-or-no decision accordingly?
Note that over-certain means you have a benchmark in mind.
If you can’t show me, I’d say that I agree with Gavin!
The 1st paper Gavin referenced is at: http://onlinelibrary.wiley.com/doi/10.1002/wcc.58/pdf
Reading that paper after Gavin’s remarks recalls Ross Perot’s (Independent Presidential candiate that lost to B. Clinton) remarks about Washington problem-solving: ‘if they talk about it enought they think they actually did something.’
Gavin cites this paper as providing evidence that climate models do have predictive skill. Apparently he uses “skill” in the unqualified sense as many shysters use “quality” to describe an inferious product they’re hawking — one typically assumes that “quality” when used in a sentence means “high quality” … but, technically, everything has “quality” which covers the gamut. Ditto with “skill” as applicable here.
That 2010 paper concludes, by the way, with:
“In the first section, it was argued that it is impossible to assess the skill (in the conventional sense) of current climate forecasts. Analysis of the Hansen forecast of 1988 does, however, give reasons to be hopeful that predictions from current climate models are skillful, at least in terms of the globally averaged temperature trend.”
That IS or IS NOT impressive…depending on how one looks at it.
Relative to the rigor endorsed by this blog/blogger (and much of its readers) as an example of climate model “skill,” Gavin’s reference merely reinforces this blogger’s essay’s original assertion–the models lack predictive skill. …unless “hopefull” counts.
That’s the same kind of equivocation we read about by cult leaders that missed their prediction of the end of the world, 2nd coming, etc. That IS impressive, but not of the sort many of this blogs readers consider truly “significant.”
Talking about “skill” and actually achieving it in a model are two very different things. Ole Ross Perot’s observation applies here as well.
Gavin,
I’m curious. If the models aren’t attempting to predict temperature then what is it they do? Why is it said that the models show AGW influence on Global Temperature when that aren’t trying to predict temperature at all? Why then isn’t RC actively discouraging that use? But I think you simply mean that temperature wasn’t the basis for the model. But something was used. They don’t seem very good at predicting future temperatures or at least can’t be show to do that. So what is it you use to evaluate them? Why should anyone consider them worthwhile?
Will,
“His conclusion that because one model, which can’t be reproduced, was able to do better than a reference way-back-when should mean current models must be okay is madness!”
If you think about it, saying that model X does better than model Y is tacit admission that skill can be measured. Otherwise, how would anyone know if X is better than Y? The “inability to measure skill” is either TRUE, in which case any model is as good as any other, or FALSE, which may be telling us that the measured skill level is being downplayed and preferably not for public consumption. Kinda like the loser of a game saying “winning isn’t everything” or “I wasn’t trying to win”.
Doug M.,
Sure it’s OK — you just shouldn’t use that data for testing the model anymore. The problem with tweaking with all of your data is that the “skill” is unknown. Until more (and/or unused) data becomes available, any claims for “skill” are on the same level as altering the results of a school lab report when the “answer” is known.
JH: what are you agreeing/disagreeing with exactly? What he described is exactly what you described. You had to wait for 2010 before you could build a model using data up 2008 and run a validation test on 2009 to 2010. Right?
Hargreaves: “In the first section, it was argued that it is impossible to assess the skill (in the conventional sense) of current climate forecasts. Analysis of the Hansen forecast of 1988 does, however, give reasons to be hopeful that predictions from current climate models are skillful, at least in terms of the globally averaged temperature trend.” [Conclusion, p. 561]
Hopeful != is. This paper analyzes a single model and claims* that it’s skillful. I’m not convinced based on the discussion, and quite frankly I’d like to have seen Ei plotted for the prediction period. Hansen may be the best of the models, but that doesn’t make it skillful as an absolute, only relative to some other model.
Gavin, Napoleon-like (no, not the French guy but the one from Animal Farm), swooped in and looked at Snowball’s (Briggs) plans and whizzed on it and left the barn without saying another word. Leaving the other lowly animals to explain his actions.
Gavin,
Are you the “Gavin” (Schmidt)? Are you perusing random websites on Government time? Should the taxpayers be paying for your foolish playtime?
We should welcome Gavin and show a little less snark. I would like to see an honest discussion from all sides on how we evaluate model predictions; everyone should bite their tongue and welcome his participation.
Will,
Right, it’s exactly what he described! But do you know exactly how he is going to “validate” it” I don’t.
So let’s go along with it, and pretend that we have future data and show me how it can be applied to data I suggested?
I’d like to know what he meant by it’s the only way to avoid over uncertainty. Yes, in a way, he is correct; hindsight is 20/20.
How long are we supposed to wait?
R^2 can be said to be a measure of the skill of a LINEAR model, so can AIC and others. The mean squared error resulting from a cross-validation method can be seen as a measure of skill too. I know all those definitions by heart!
Where is the technical definition for a so called “strong skill� I didn’t see it, did you?
How would he quantify or decide the skill of a model with one or two or more new observations? How exactly should one decide whether a model is skillful in this case?
Is a model said to have skill if the new observation is exactly the same value as the point estimate based on the model? Of course not!
Is a model said to have skill when the new observation falls in the prediction interval? The chance is that it will if the statistical model is diagnosed to be adequate.
Last word: a sufficient sample of performance measures must be built to demonstrate there is high probability that skill is positive. We build these models (of future skill) as we build all probability models.
Sounds noble and righteous. I can’t say it’s wrong. But show us how? Skill is positive… so Mr. Briggs does have a measure of skill. Show it!
So show me how it’s done, otherwise, Mr. Briggs is too certain of himself… possibly too full of himself.
(Yes, I know Mr. Briggs would also be reading this comment.)
Will,
Right, it’s exactly what he described! But do you know exactly how he is going to “validate†it? I don’t.
So let’s go along with it, and pretend that we have future data and ask Mr. Briggs to show me how it can be applied to data I suggested?
I’d like to know what he meant by it’s the only way to avoid over uncertainty. Yes, in a way, he is correct; hindsight is 20/20. How long are we supposed to wait?
R^2 can be said to be a measure of the skill of a LINEAR model, so can AIC and others. The mean squared error resulting from a cross-validation method can be seen as a measure of skill too. I know all those definitions by heart!
Where is the technical definition for a so called “strong skill� I didn’t see it, did you?
How would he quantify or decide the skill of a model with one or two or more new observations? How exactly should one decide whether a model is skillful in this case?
Is a model said to have skill ONLY if the new observation is exactly the same value as the point estimate based on the model? Of course not!
Is a model said to have skill when the new observation falls in the prediction interval? The chance is that it will if the statistical model is diagnosed to be adequate.
Sounds noble and righteous. I can’t say it’s wrong. But show us how? “Skill is positive”… so Mr. Briggs does have a measure of skill. Show it!
So show me how it’s done, otherwise, Mr. Briggs is too certain of himself… possibly too full of himself.
(Decided to resubmit.)
JH,
I would be happy to provide you with a pointer to the skill literature, which is very well known in statistics and meteorology. The literature is vast. Send me an email and I’ll give you my bibtex file of references, or just use your favorite search engine to find papers, they’re everywhere. Look up particularly “Tilmann Gneiting Skill” for some of my favorites. Also look up Mark Schervish. Look up “proper scoring rules.”
Pingback: William M. Briggs, Statistician » Open Thread: What To Chat About Next?
Pingback: William M. Briggs, Statistician » Hurricane Predictors Admit They Can’t Predict Hurricanes