Statistics

Seasonal Climate Forecasts Do Not Have Skill

David Lavers is the lead author in a GRL paper “A multiple model assessment of seasonal climate forecast skill for applications.

Lavers et al. checked eight different climate models and found that, “Results suggest there is a deficiency of skill in the forecasts beyond month-1, with precipitation having a more pronounced drop in skill than temperature.”

Nature magazine summarized:

‘Skill’ is the degree to which predictions are more accurate than simply taking the average of all past weather measurements for a comparable period.

…existing climate models show very little accuracy more than one month out. Even during the first month, predictions are markedly less accurate for the second half than the first. Current models simply cannot account for the chaotic nature of climate, researchers say.

My friend and former advisor Dan Wilks at Cornell (who wrote the most influential meteorological statistics textbook) completed a similar analysis about a decade ago and found much the same thing.

I’ve also done work on these (earlier) climate models (peer-reviewed!), too. Things have not really changed. The climate is so complicated that these models just aren’t that good.

But these seasonal models aren’t necessary the same as the global climate models that have people so flustered about Global “Don’t Call Me Climate Change” Warming. Some seasonal models are more statistical, some more physical. But they all try and guess the future, with, as we now know, limited success.

At the same time, Kevin Trenberth, an IPCCer, announces that the 2013 AR5 report will have at least one chapter “devoted to assessing the skill of climate predictions for timescales out to about 30 years.”

I’m guessing he’s using the word “skill” in a different way than is usual. In order to check for skill, we need two things: independent model predictions over a period of years, and the subsequent observations for those years.

They key word is “independent.” The models have to forecast data that was never before seen. It is no good—no good at all—to show how well a model “forecast” data it already knew. You can’t fit the model to data on hand and then show how close that model is to the old data. That’s cheating.

Every statistician knows this is a no-know. And by “no-know”, I mean that we cannot learn how good the model actually is until it shows it can make accurate predictions of new data. (It’s not just climate models that suffer the lack of independent predictions: most statistical models are like this. See yesterday’s post.)

I stress “independent”, because of Trenberth’s use of “30 years.” In order to reach that figure and make it into the 2013 report, climate models set in stone and unchanged in 1980 would have had to make yearly predictions out to 2010. That’s 30 years.

If those set-in-stone-1980 models best average-climate forecasts, then climate modelers will be able to tout actual skill.

But since no climate model has sat itself in stone since 1980, there does not exist 30 years worth of independent evidence. We will still be relying on how close those models fit the already-observed data. The closeness of that fit will be touted as “skill.”

But I’m guessing. Maybe Jim Hansen created a model in 1980 that has secretly been making predictions all along. This will be revealed in 2013. If so, and if that model does have skill, and if it continues to predict increasing temperature, I will personally write emails to every global warming skeptic telling them to cease and desist their efforts. (Technical note: a reanalysis doesn’t count.)

Actually, it’s even more complicated than this. That “1980” model might have made predictions for 1981, 1982, and so on. At the end of 1981, the “1980” model would remain fixed, as far as the physics and model code are concerned, but it would be fed the data observed from 1981. This data-updated, but computer-code-fixed, model would make a new one-year-ahead (1982) prediction, plus a two-year-ahead (1983) prediction, and so forth.

All experience shows that the shorter the lead time, the higher the skill. Lavers found that in the seasonal models, as Wilks did, as I did, as everybody does.

It might exist, but I have never seen a climate or weather model that improved as the lead time increased. There is no earthly reason to expect global warming models will be any different. Trenberth would agree with this.

The point is that no or very few, say, ten-year-ahead forecasts can exist. Climate models just aren’t that old. Worse, even if these forecasts did exists, and even if they were as accurate as you like, it will be nearly impossible to use them to prove skill.

Skill is a statistical, probabilistic, measure. In order to say skill exists with any confidence, a reasonable sample size must exist. Say—just guessing here—at least twenty. That means we would have to have, in hand, twenty separate independent predictions.

We don’t have that. Climate models have not remained static. We do not know whether they have skill. And all experience with other, similar, physical models suggest that demonstrating skill is tough.

So why do people believe in the models so strongly? Because they fit the old data (not perfectly, but pleasingly). But any statistician can tell you that this is no great trick.

Categories: Statistics

33 replies »

  1. I understand that weather forecasts are an initial conditions problem — I see on the radar that some rain is moving this way so I predict that it will rain — while climate forecasts are based on boundary conditions — the atmospheric blanket around the earth is becoming better (or worse or isn’t changing) at retaining energy so I predict that the climate will warm (or cool or not change). Seasonal forecasts rely on “big picture” initial conditions such as ENSO and snow cover whose influence on weather is not well understood (hence the lack of skill) but too short for the boundary conditions to have an effect.

  2. Why do they not use the known data up to 1979, put in in the model and run it and see how well the output matches the known data from 1980 to 2009? They could still use the real known solar, volcanic and CO2 figures as input over the thirty year period.

  3. Your misconceptions of what is being discussed with respected to initialised decadal predictions are even more profound than your usual climate related confusions. Perhaps if you consider, for a moment, that the people involved are not complete idiots, you might realise that they are perfectly aware that there is no track record of such predictions that can be tested but that there are other ways to test the usefulness of any such approach.

    The basic idea is to see whether the ‘slow’ internal variability component (associated with ocean circulation and heat storage) of the climate has any predictability, separate from the predictability of the forced component (driven by GHGs, volcanoes, solar etc).

    You can do ‘perfect model’ set ups where you try and predict the track of a coupled model using only a subset of the data (as could be observed in the real world up to the present) and see if you can successfully predict the subsequent evolution. You can do what is done in the re-analysis process – use historic data up until a point and then see if the free running prediction had skill. Indeed, both of these ideas are part of the CMIP5 protocol on decadal predictions that will be assessed in part of AR5.

    You appear to be especially confused since you imply that a climate model built today somehow secretly knows about the temperature and rainfall evolution of the last 30 years or more and that each year we add one more set of annual values to the models to make them better. This is a nonsense. The climate models are not statistical models trained on simplistic indices – not even close. Indeed, you can do a perfectly valid test with a model built yesterday, intitialise it with ocean conditions in 1970 or 1980 or 1990 and evaluate it’s 10-year predictions of internal variability against the actual observations. They’ll be badly off when there is a big volcano, but that’s life.

    Personally, I am not involved in any of these efforts and have yet to be convinced that they will show any useful skill, but it is a worthwhile endeavour and we’ll learn a lot about the limits to such predictability. It is an effort that is worth far more than the unjustified sniping on show here.

  4. Sapere Aude: I think that’s better, but still has flaws because there’s also the question of selection of model. We may have an idea now that model X from 1980 is rubbish, because the 1990s refuted it, but model Y is pretty good. It’s then reasonable to stop even considering model X, and only publish the skill of model Y. If you do that, even if you don’t feed post-1980 readings into model Y, you’ve still implicitly used the post-1980 data to tune a choice of parameter of your overall model, because you overall model is not just Y – it’s “X or Y depending which is better”.

    This happens all the time, anyway, as the various models’ predictions get aggregated with different weights depending on flavour of the month. The process is also pretty much unavoidable, as many models are related to each other, and bad models get dropped or tuned to give better results. At what point is it a different model? Does it matter? Maybe it’s not a bad thing, it just has the feel of a (small) data-dredge, and presumably all the same problems when you come to interpret the data.

  5. (This is being typed for me as I am in the Oakland airport and the internet connection doesn’t work.)

    Gavin, How I’ve missed you!

    Come back sometime late today when I’ll show you how you’re wrong. How naughty you are for saying that I suggested people are “complete idiots.” It must be your wonderful sense of fun.

  6. So it appears Gavin is saying the professor and this crowd are picking on “scientists” who have usurped statistical terminology and methodology for another – higher – branch of scientific inquiry, one that is far beyond the capacity of those here to comprehend and critique, and that we should shut up and let those folk go on about their extremely important work. OK, I can accept his point of view.

    But dear Gavin seems to “believe” in some way [he apparently has “faith” of a sort] that those dedicated uber-scientists working with manipulated and shaped data which was been systematically hidden from lower level scientists, and who working in a tightly controlled and limited peer community will somehow discover the holy grail of climatological prediction – and then won’t that show us!

    Prophets like Gavin only come along once a millennia, it seems. We are blessed to receive his insight. Hail, Gavin.

  7. Should be an interesting post. A good initial test might be a hold-out method but when it comes down to nuts and bolts, I don’t think the climate models came close to predicting the declining temperatures in the last decade (oh, yeah, I forgot — that didn’t happen). If the models have any credibility which world is it they are modelling? Now it is becoming necessary to hide the embarrassing decline of believers. Many of whom have been led astray by godless sceptics and heathen statisticians.

  8. “They key word is “independent.” The models have to forecast data that was never before seen. It is no good—no good at all—to show how well a model “forecast” data it already knew. You can’t fit the model to data on hand and then show how close that model is to the old data. That’s cheating.”

    that’s pretty much the whole issue

  9. George,

    The real problem comes about if both X and Y are kept and the one with the current right answer is moved to the forefront. I can see model averaging when there is an initial conditions problem or the mixture of variables is unknown but these are shots in the dark and one is left with the problem of how to weight them. Using new data to tune the weights is definitely incestuous.

    Sapere Aude,

    What you’ve described is the essence of the hold-out procedure. One problem is verifying that the training data and testing data are truly representative. A model predicting human running abilities might be hopelessly optimistic if trained only with data from the School for the Olympic-bound and this would likely become obvious if applied to attendees of Couch Potato Vocational. How one goes about showing that weather data, which only extends back about 200 years or so, is representative is open to question. Also, the claim is that the recent decade or two are unprecedented. By definition then, they are unrepresentative. OTOH, the basis for the “unrepresentative” claim is the output of the models. Chicken and egg, eh?

    I understand that the models are tuned periodically. One can only wonder what that really means but it sounds suspiciously like making the models fit current data. Unfortunately, that’s using testing data for training. It’s like giving students the answers to the test prior to administering it. One can only wonder: what’s the point?

  10. My regression toward the mean model has stood up pretty well over the past century. It has been right much more often than wrong. It is only useful for one year ahead though. Each year it predicts the temperature to move the opposite direction it did last year.

  11. Briggs (in absentia) I can’t wait – but I’m still waiting for the proof that George Box was wrong when he said some models are useful, and the proof that averaging anything in any way is statistically impermissible. But take your time, don’t make me rush you into making any more mistakes.

    For reference, here are the CMIP5 protocols so that you can have at least some actual facts upon which to draw. I usually find that helpful myself. YMMV.

  12. The prediction that there will be an AR5 may not be skillful. The IPCC has shot off both its feet, and appears headed for the dust bin of history. That’s life.

  13. Bob Koss,

    If it can only be used one year ahead does that mean it doesn’t predict an oscillation? If so, it’s much like a NWS forecast for a DC summer day: sunny with a chance of thunderstorms. Not only does it cover most bases but the daily thunderstorm prediction eventually becomes reality.

  14. Gavin,
    Please keep this up! I can smell a learning experience in the making.

    Briggs,
    I would like you to know you have infected me. I went to my first appointment with those
    people I will refer to as the human Roto Router folk. They got to the warning part and the
    nice lady mentioned that there is risk in this procedure and that 1% of patients have bad
    experiences. I immediately tuned her out and started thinking about whether or not it was
    more dangerous to have the procedure than not. Even got to the point that I wanted to
    know where I could find numbers to start tossing into equations.

  15. “Perhaps if you consider, for a moment, that the people involved are not complete idiots”: well, I have considered that possibility. I don’t know about “complete” but I have concluded that they are, compared to many of the people I have worked with, duds. Really, they have done too many stupid things over the years. That is, of course, a different issue from the question of how many of them are crooks too.

  16. I would like to request both Gavin and Matt to please leave as much snark at home as possible while discussing this, and further request that those of us in the peanut gallery additionally refrain from insulting and OT comments. I look forward with great anticipation to following an informed and civil discussion.

    bob

  17. Gavin,
    It’s over, finnished.
    One last thing. ‘England confides’. Never forget. He meant you, too.

  18. All,

    Apologies for slow response. Travel took longer and more out of me than expected. But here we all are.

    Gav,

    I’ll do, today—but it might not appear until tomorrow—a “Box” post. But it will echo this paper, which you can download. Much more to induction/logic than is in that paper, but a sketch of Box proof is in there. Let me know what you think (in the new post, which I’ll call “All models are not wrong.”).

    Now here is why the climate models (probably) won’t be able to demonstrate actual skill. I say “probably” because, as I mentioned above, it is possible a set-in-stone model has been churning away, producing forecasts all these years, only we haven’t been told yet.

    However, I doubt this is true because you claim it is false. You say that the GCM folks—perspicacious, all—“are perfectly aware that there is no track record of such predictions that can be tested.” To which I say, Amen.

    And since that was the very point I wished to make—that there is no track record of predictions that can be tested, therefore there can be no demonstrable skill—then I may rest.

    But, for our audience, I’ll answer the second part of your comment, which was “there are other ways to test the usefulness of any such approach.” To which I say, I agree; there are (and you give examples),. but demonstrate usefulness they may, but demonstrate skill they have not.

    What you’re advocating is that the match—the closeness—between a model and observations be used as evidence for that model’s accuracy. It is evidence, too, but weak evidence. For a model to be useful, it is a necessary condition that a model “look like” previously observed data. If it cannot, then the chance that it “looks like” future data is small.

    Incidentally, what I say goes for any kind of model: subjective, purely mathematical, physical, statistical, probabilistic mixture, etc.

    Suppose it is 1990 and we have Model A which makes forecasts for 1991, and in 1991 it makes forecasts for 1992, and so on. 2000 rolls around and we discover that its predictions have been poor. “Aha,” somebody smarter than both of us says, “It turns out that if we divide by 2 there, insert a superior parameterization here, and pop in an extra sigma level in the ocean over there, the model will improve.”

    Call that improved creation Model B. It is no longer Model A, but a close relative. We then go back to 1990 and re-forecast 1991, then we go back to 1991 and re-forecast 1992, and so on. We then notice that Model B has done a much better job of predicting that decade than Model A. We announce this with satisfaction.

    But Model B has not showed skill. It has used previously-observed data in its creation. We knew that data when we decided to fix A: we tweaked A (and turned it into B) so that it would fit that data better. We knew where A should have gone and made a B that did so.

    Model B has still not demonstrated it can make skillful independent predictions. Not that it cannot. Just that it has not.

    Since this is so, we do not have sufficient confidence in the model to use it to make global-life-changing decisions. However, it is possible to disagree with this conclusion. One’s views of the possible future horrors that await us can be so overwhelming that even a slight chance the model is right is enough to use that model to make decisions.

    I concede this theoretical possibility. But I do not agree with it.

  19. Gav,

    I am usually a master of wit and such forth, but you lost me with “YMMV.” What’s that?

  20. You fundamentally misunderstand what models are calibrated against. None of the GCMs are tweaked because of the change in temperature pattern from one year to another – none. They get tuned on the climatologies (the 30 year mean), seasonality and the magnitude of the interannual variability. None of these things have changed appreciably or even noticeably with a few years of extra data.

    Second, you still seem to think that initialised predictions with climate models are things that have been done before. They aren’t. This is a brand new effort. So there is no (even implicit) fixing of models to improve these predictions. How can I improve something that has not been done before? There was no ‘Model A prediction’.

  21. Gav,

    My example was meant to demonstrate a theoretical point. Choosing “years” made it simple. But if you don’t like years, go back and change it to decades, or thirty-year chunks, or whatever you like. My conclusion stands. As it does if you change the accurate “tweak” to the more pleasing “tune.”

    Second, if there are no predictions—which I dispute, but that’s irrelevant here—then you are asking people to make decisions based on models that have not demonstrated skill. Which again proves my claim.

    I don’t mean this in a snarky way at all, but I do not see where we differ on this. We are stating the same thing. We only differ in that you believe the models’ “closeness” to previous observations is (1) sufficient proof of their ability to make skillful predictions, and/or (2) that the theory underlying those models is almost certainly true (you might even say, is true).

    I believe (1) does not hold (yet), and that it follows that (2) is also not know with sufficient certainty.

    (I also dispute your “implicit” suggestion that no model fixing has been occurring, which is an empirical question; however, let that pass, too. Oh, and the Box post is up.)

  22. …’you might realise that they are perfectly aware that there is no track record of such predictions that can be tested..- Gavin’
    And this is reasonable?

  23. Sorry to be blunt, but this is pretty pathetic. The measure ( or “skill” which is pretty funny word to use) of a model is not how well it matches some future time series, but how well it generates information that was not fed into it. It’s a question of metrics. Models which are not usable to predict the future, as in astrophysics, have to match more data than the data which is used to create them. This means that the physics involved is producing something useful.

  24. lnocsifan,

    My first instinct, on seeing your thoughtful comment, was to reply, “Sorry to return the bluntness, but you have lost your mind.” However, this is the new and improved, snark-free Briggs, so I’ll say something instructive instead.

    Skill has a precise technical meaning, one which meteorologists and climatologists, like my good pal Gav, are well aware of. Further, it is used regularly as a true test of model goodness. Lastly, I reject your claim that, for example, astrophysical models “are not usable to predict the future.” That’s a mighty bold statement, obviously false, and missed the point (which others have pointed out in comments in recent posts) that when I said “future data” I meant “data not yet known.”

    To dismiss the concept as you have, without even the pretense of investigating it, is a mistake. Surf on over to my Resume page and look for some papers on skill (the Biometrics one is fair) and then come back and re-comment. The literature on this topic is vast.

    Incidentally, was is an “inocsi” that one can be a fan of?

  25. I’m not asking people to make decisions based on these runs at all – the whole thing is highly experimental, and as I said I don’t know how much practical skill they will have. Perhaps you think that this is what all transient climate simulations are? The difference is that normally there is no attempt to synchronise the internal variability to the real world, and so the only prediction that is useful is the forced component (ie. the ensemble mean). You may have confused these kinds of predictions with erroneous claims that ‘the models’ predict that every year will be warmer than the last, but these are of course different statements. In the standard set up, it is easy to demonstrate that short term trends (say less than 15 years) are not strongly constrained. I would have thought you would have been supportive of trying to find ways to do better.

    In general, matches to observations improve confidence to the degree that the matches are out of sample. A new test of a model is a priori out-of-sample since no test of any previous model exists and no possibility exists of having tuned the model to fit. Whether someone somewhere knew about this data ahead of time is irrelevant if it didn’t go into building the model.

  26. Gavin,

    Your point is important and we need to separate from the subject of skill. I will do so in a different post (if I am tardy or forget, please remind me).

  27. I’m currently researching ideas while working on a day trading system, and I stumbled upon this discussion. I find the talk about models, prediction, and skill very relevant to evaluating the performance of any trading system (especially the idea of setting something up, seeing how it would have performed in the past, seeing how it performs going into the future, and the significance of different variations on that). Thanks for the discourse.

Leave a Reply

Your email address will not be published. Required fields are marked *