On Global Warming Apoplexy: Temperature Trends

It is a sure sign that Sanity has packed her bags and headed for the door when otherwise sober scientists begin slinging around terms like “denier” and “denialist.” Language like this displays willful, pretended, or real ignorance of the historical context of these words. Anybody who talks like this makes himself an ass. They’s fightin’ words which start any discussion on an angry footing, their presence a certain indication we are dealing with zealotry, not science.

Let’s look again at the claim made by the scientists at the Wall Street Journal, over which many have popped their corks:

The lack of warming for more than a decade—indeed, the smaller-than-predicted warming over the 22 years since the U.N.’s Intergovernmental Panel on Climate Change (IPCC) began issuing projections—suggests that computer models have greatly exaggerated how much warming additional CO2 can cause.

There are two claims made here. Given the observational evidence we have, both claims appear true. The first (A) is that for the last ten years it has not grown warmer. Since it has grown warmer in some places and colder in others, this is evidently a claim about some global average and not any individual station. The second claim (B) says that the IPCC forecasts have been systematically too large: it is also concerned with some global average.

Both of these claims are quantitative and subject to easy verification. A person’s politics surely has no bearing on whether they are true or false claims. Now, the “global average” referenced is not a static thing, in the sense that, say, measurements from identical (and identically situated) thermometers at fixed locations are averaged together and called (arbitrarily, of course), the global average. Instead, the global average as it is operationally defined mixes sources and locations freely each year (and even within years). Therefore, when the “average” is computed there will be some uncertainty in it. Further, the uncertainty is larger in times historical than in times present. (There is even some uncertainty at individual locations, because no measurement apparatus is perfect, but this is generally small, though not always, especially in the past or when using proxies: see this series.)

The BEST people, for instance, recognized this and attempted to account for measurement uncertainty by speaking not just of averages, but of averages plus-or-minus. We can, and I did, argue over the better way to calculate and display this uncertainty. All we need to understand here is that some techniques underestimate this uncertainty. Actually, we don’t even need to agree about that: but we do need to see that some uncertainty is present, however small.

This is necessary because if we make claim (A), as the WSJ fellows did, we need to take uncertainty over the global average into account or we cannot know whether the claim is true or false. It is at this point when a lack of understanding of statistics can become a real hindrance. Sloppy language also hurts immeasurably. Let’s work through this slowly.

Suppose we have ten years of uncertainty-free global average temperature measurements. We can line them up and ask questions of this series. Was the temperature ten years ago warmer or colder than the temperature this year? All we have to do is look: it will be true or false at a glance. Was the temperature nine years ago warmer or colder than this year? True or false at a glance. And so on.

What does this mean in the context of claim (A)? Well, (A) says that temperatures have not gone up over the last decade. To verify this, all we need do is look to see if any of the temperatures of the last decade are lower than they are this year. If any are, the claim is false. If none are, the claim is true.

Maybe. Because claim (A) can also be taken to mean that at no time over the last decade have the temperatures increased (they could have stayed constant from year-to-year). Again, we can verify this claim with a glance at the data.

Which of these definitions is right? Evidently neither, because we all understand that the temperatures have some uncertainty in them. Because of that, we cannot just look at the data to say whether it has gone up or down; we instead have to speak of changes in probabilistic terms. And that means hauling in some kind of model.

The simplest (but not so good) model is to imagine each year’s data is irrelevant to knowing each other years’ data. That is, we take this year’s data and display it as an average with so, a plus-or-minus attached to indicate our uncertainty in it. That plus-or-minus can only come from some kind of probability model, meaning that the range of uncertainty will change when the model changes. Which is the best and most proper model? Nobody knows. But let’s imagine we all agree on one, such that displayed before us is a temperature series of averages and plus-and-minuses.

Now, if claim (A) means that temperatures this year are less than or equal to temperatures ten years ago, then we can make a comparison as before, but our comparison will be accompanied by a measure of uncertainty. Using predictive techniques (yes, this is the proper word: see this series), we can ask questions like, “Given the data and assuming our model is true, what is the probability this year’s temperature is less than or equal to temperatures ten (or nine, etc.) years ago?” Notice that this is not the same as a “t-test” or any other kind of statement about parameters of probability models: it is a statement about observable temperatures.

Or, if claim (A) means that temperatures did not increase even once over ten years, then we can get the probability of this just as simply. In support of either version of claim (A), I said that we cannot know with probability greater than 90% that temperatures have increased (over this last decade). In other words, it is likely that claim (A) is true.

This is so using the probability model I indicated. But what if we instead change the model to a linear regression—i.e. a straight line—drawn through the data? Well, we could go through the same steps and ascertain claim (A) in light of this model. But before we can begin we have several things to decide. Why a straight line? Just because it’s easy? Lazy, that. From what year do we start? See this post for the ways that choice can lead you wrong. Do we start with a date (as I joked) in the Jurassic? Or, for fun, in 1973? Every different start date will give a different answer. I will repeat that: every different start date will give a different answer. It is also a stretch, to say the least, to assume temperature always has been increasing in a straight line from whatever start date we pick. (Before the politicization of this subject, every physical scientist would have agreed with that last statement.)

But suppose we do agree on a date: 1964, say, a very fine year. Are we done? No, because we cannot forget that the data that goes into the straight-line model is still measured with uncertainty. We must, just as we did in the first model, account for this uncertainty. That means drawing any kind of naive line (even bold red ones) guarantees over-certainty.

Even if we were to agree on a date—in real life we do not—we could use a model of the measurement error, incorporate that into the model of straight-line change, and then assess claim (A): it is still probably true.

The best thing to do is to model the data in an intelligent way, taking into account the correlations of year-to-year (both auto-regressive and moving average), the measurement error, etc., etc. Hard work! As Doug Keenan has pointed out (often), it’s too much like work for anybody to do. I’d do it myself, but my check from Big Oil hasn’t yet arrived.

Whatever else you do in life, you must not, you must never, look at the pretty red (or blue, etc.) straight line you have just drawn and claim it is, or think of it as, the real data. (It is only in climatology where I have seen scientists forget error bars, and then pitch a fit when somebody points out the omission. You at least have to put predictive, and not parameters-based, error bars on the line, even ignoring measurement uncertainty of the data.)

What about claim (B)? Also likely true, as is generally recognized. We still have to incorporate the uncertainty in the global temperature measurements—there is no or little uncertainty in the forecasts—but this is no different than before.

What about the counter-claim (C) that the 2000′s where the “warmest years on record” or the like? It is trivially false. The 2000s simply were not the warmest. Four billion years ago, Earth was much hotter. “Wait! It’s obvious we weren’t talking about billions of years ago. Cheater! Denier!” Well, it isn’t obvious. What years did you have in mind as comparators? Ah, that’s the real question, isn’t it.

Did we mean just the last century? The last 1000 years? The last 10,000? What? You must supply a starting year. To make the claim (C) that it’s hotter now than before, you must tell us what you mean by before. If you say “before” means the last ten years, then claim (C) is identical with claim (A). If you say the last 200 years, then you have to do what BEST tried and incorporate the non-parameter error bars, otherwise there is no way to compare what happened a century ago with what happened last year. Obviously, the further you go back, the larger those uncertainty bars become, therefore the more difficult it becomes to claim (with any certainty) that now was hotter than then.

As I often say, over-certainty abounds in this field. People speak of models (statistical and physical) as if they were truth, as if the data that goes into them were granted some kind of special immunity from ordinary criticism. And when the critiques come, that’s when the asinine language breaks out. All sense of humor evaporates.

You would think that because both claims (A) and (B) are likely true (and claim (C) is unproved or likely false) that we have found a reason to celebrate! Perhaps our worst fears won’t be realized after all. This is good news! Wouldn’t it be great if we really did over-emphasize feedback in climate models and that whatever changes we do make to the climate are easily mitigated and not as horrific as posited?

Why so glum that things are so good?

Update See this cartoon which shows that the IPCC has been known to employ the technique of variable start dates.

Update It is imperative that all read this series, where I describe just how so many people make mistakes. Those below who have been shouting the loudest are most in need.

Comments

On Global Warming Apoplexy: Temperature Trends — 135 Comments

  1. @Will, In a statistical ensemble (I use them a lot as well, mainly bagging) each individual model is intended to be a predictor of the value of the response variable as a function of the attributes, and ensembling provides a useful variance reduction that on average will improve predictions.

    In a GCM ensemble, the individual model is not intended as an accurate predictor of observed climate (as that is effectively impossible), but a simulation of how the climate might evolve. The way that climate might evolve has two components, a deterministic (hopefully non-chaotic) response to the forcings (the forced response) and a chaotic component (the unforced response) which is essentially “weather noise” comprised of things like ENSO. So with a GCM ensemble, averaging over many runs cancels the unforced component in each run and leaves you with an estimate of the forced response, which is what we need to know for planning a course of action.

    Now the observed climate will have a forced response and an unforced response, so we should not expect it to match the ensemble mean (which is an estimate of the forced response) even if the ensemble is exactly correct. How close we should expect the observations to match the ensemble mean depends on how large we can expect the unforced response to be. At the moment the best way to determine this is to simulate future climate many times and see what range of outcomes we might see around the forced response. This is exactly what the spread of the ensemble runs tells you.

    Sadly we can’t estimate the magnitude of th unforced response from the observations as we have only one realisation of the actual climate to look at.

    I have found a double pendulum is a good analogy to the way a climate model operates http://www.skepticalscience.com/on_consensus.html#20068

  2. @Will Nitschke There is no double standard, the trends in the SkS escalator (middle figure) are not statistically significant and essentially meaningless (which is why you shouldn’t do that), the ones in the IPCC diagram (bottom figure) are statistically significant, which is a very different matter.

    Amusing cartoon, but don’t take it too seriously.

  3. Dikran Marsupial, Will Nitschke,

    But then “statistically significant” is what we agreed is not a way to measure model truth, as the opening comments to this thread show.

    Yes, the IPCC makes the same mistake. “Statistical significance” is, as regular readers of this blog know, trivially easy to find. Why those lines for the IPCC? How can we know those models were true, etc.? It’s the same question as we started with.

    As always, the best test is a model which predicts new data well (claim (B)). So far, the IPCC does not do well at that. We are right to suspect their models.

    Update: be sure to see the post on how to fool yourself and others with time series.

  4. Dikran, Yes, I am aware that Douglas 2007 has a statistical error which is why I asked for your opinion about MMH 2010, not Douglas. MMH uses econometric techniques to test models versus observations and to my knowledge its conclusions and techniques have not as yet been refuted.

    As you are familiar with Douglas 2007, you will know that it was refuted by Santer 2008. The latter used an improved statistical test as opposed to your simplistic spread of the ensemble. You should look at the list of co-authors on Santer 2008. You appear to be in a small minority in believing in your simple spread test. Was Santer 2008 also incorrect?

    Further, MMH 2010 showed that if you apply Santer’s own approach to latest data, the model projections are significantly different from observed trends.

  5. Pingback: William M. Briggs, Statistician » Bad Astronomer Does Bad Statistics: That Wall Street Journal Editorial

  6. @certy, I suspect the test in Santer is slightly too conservative (I am not sure it fully accounts for the uncertainty in the physics, although I believe it does account properly for unforced variability properly). However it is rather a long time since I read it and I haven’t analysed the data to determine whether this is a substantial problem.

    I’ve not read MMH2010 yet, so I can’t comment on that. It is worth noting however that the majority of papers that are incorrect are never formally refuted, they just get ignored. There is just so much published these days that not many researchers write comments papers any more as there just isn’t time.

    IIRC Gavin Schmidt is also a co-author of the Santer paper, and he wrote an article on RealClimate discussing model-data comparison, and he used the “lies in the spread of the models” test (I suspect he used +- 2 standard deviations, but it amounts to petty much the same thing). So there is precendent for the test, even if you don’t accept the explanation of why it is a reasonable test.

    Note that as I mentioned I think the Santer test is slightly conservative, so I am not unduly surprised if the test gives different results for different timespans. It is because the observations are around the limit of what the models can currently explain by natural variability. This could be a problem with the models, or with the observations, or it could be that the realisation of natural variability that we observe is unusual, or a combination of all three.

  7. Dr Briggs, science is going to continue using frequentists hypothesis tests for the forseeable future, and they do provide a useful sanity check. The trends in the SkS escalator fail this sanity check, which is a good reason not to use them (whch is the point of the diagram). The trends in the IPCC diagram have at least passed that basic sanity check, so it is a non-sequitur as I pointed out to suggest there is a double standard between the escalator and the IPCC diagram.

    “How can we know those models were true,”

    I’m with GEP Box on that one, all models are false, but some are useful.

    “As always, the best test is a model which predicts new data well (claim (B)). ”

    I would agree, however (i) we don’t have enough new data to reliably assess the projections of the model. (ii) the test should be a test of what the models actually predict, which (B) is not, for the reasons I have given.

    “We are right to suspect their models.”

    All models should be subject to suspicion.

  8. @dikran marsupial: Thanks for the information and the link. I appreciate you taking the time to help answer my questions. Hopefully I can return the favor someday. :)

    I will have a look at the double pendulum example you provided.

  9. @dikran marsupial: just read the example. Very well written btw.

    The pendulum example you provided was very well explained, but the analogy is misleading; you are comparing a system where the functions describing thr dynamics are known to be perfectly precise and there is only one external influence which can be turned on or off at will.

    This is not the case with our moist blue sphere. We have multiple unknown forces, with an unknown number of interactions which are still not understood, and a method of simulation (box models, etc..) which is known to be imprecise.

    To use the pendulum example; pendulum has two to infinity joints, between 0 and infinity electromagnets, force of gravity is changing, and sometimes the pendulum is a superconductor– and we only get to see the pendulum animation as a 2×2 pixel image. (Thats my attempt at being funny..)

    Is my revised analogy in the right ballpark, or am i still horribly confused?

  10. @Will, yes the double pendulum is indeed much simpler than climate modelling for exactly the reasons you suggest; however even when you know the physics exactly and there is only one forcing, you can still only expect the observations to lie within the spread of the ensemble of models.

    Another good analogy is to consider having a time machine that could visit alternate realities where Earth had identical climate physics and identical forcings, but different initial conditions (perhaps a different butterfly flapped its wings in different universes). These alternate Earths are perfect models of climate change on our Earth, with perfect physics and essentially infinite temporal and spatial resolution. Climate modellers could not even theoretically improve on this model (without having exact information on initial conditions). We could make an ensemble from these alternate Earths, but even then you could only expect the climate on our Earth to lie in the spread of the ensemble somewhere.

    There would be no reason to expect the observed climate to be any closer to the ensemble mean than any of the alternate Earths comprising the ensemble (there is nothing special about our Earth).

  11. @Dikran Marsupial: Thanks for the response.

    What you are describing as a model ensemble sounds more like a sensitivity analysis. The model results are not actually combined to create an ‘ensemble estimate’, but rather are used in the way results of a risk assesment would be. Really, the models themselves become parameters in a way. Am I correct in thinking this?

    I’m really stuck on something though.. Using the pendulum example:

    If you have an unknown number of electromagnets and the pendulum could have an infinite number of moving segments, how can you be confident that any approximation the pendulum system is a valid approximation?

    In other words, how can you know that you’re exploring the bounds of a valid, or even plausible, model? If you’re using a 2-10 arm model in your simulation, what reason is there to not use a 50 arm model? What reason is there not to include wind and air viscosity in to the pendulum model?

  12. @Will, it seems to me that the best way to view the ensemble is as a subjective Bayesian posterior distribution of the plausible outcomes given the climate modellers current understanding of climate physics. The spread is evey bit as important as the mean in characterising the posterior. This will be very familiar with Bayesians as statistical decision theory would suggets that we integrate over the whole posterior in evaluating the expected loss for differrent courses of action, rather than just concentrating on the most plausible outcome (the ensemble mean). Of course our knowledge of climate physics is imperfect, to say the least, but it would be irrational to plan our course of action based on anything other than our best understanding of climate physics, including all uncertainties, as embodied in the models.

    The climate system does not have an infinite number of forcings, the major forcings such as solar activity, GHG radiative forcings, aerosols are well characterised in the models. Some minor forcings are less well characterised (e.g. clouds), but there isn’t currently a great deal of evidence to suggest that they are dominant, and no model can include all of the features of climate (without an Earth in a parallel universe). This is what GEP Box was getting at when he said “all models are false, but some are useful”; all models are necessarily abstractions or simplifications of reality in order to be tractable, and it is always bearing in mind that the model is not “true”, no matter how well it fits the observations. Similarly the temperal and spatial resolution of the model don’t need to be ininite for the model to be useful. Such models are routinely used in science and enginnering. The higher the resolution the better the simulation of climate, but that is the nature of approximations.

    The important thing I am trying to say is that if you want to compare observations against the multi-model mean, then there is no reason to expect the observations to be any closer to the mean than the plausible range of effects of unforced variation. The problem is that we cannot estimate that range from a single data point (the observed climate on our Earth); the best we can do is to estimate it from climate models. This makes any reasonable test of the models rather circular (other than the lies in the spread test) as it would rely on the spread of the model ensemble not being an underestimate of the plausible spread due to unforced variability. As the models are less complex than reality, this doesn’t seem a reasonable assumption to me.

    The bottom line is that the model is wrong, we know that as all models are wrong (GEP Box), the question is, “are the models useful”, and if the answer is “no” what are you going to replace them with that will be better (statistical forecasting won’t be as it will involve extrapolating from the model beyond the conditions underwhich it was calibrated, which is very risky unless the model captures causal rather than statistical relationships).

  13. @Dikran Marsupial: We are on the same page regarding the ensemble. Having never made a perfect model of anything, I happily agree with you that a model can be imperfect while still being useful.

    That said, there are a number of assumptions made in the description you provide that seem to be important to the outcome, and leave me a little unsure of how the process works.

    - How do you know that a forcing is a major or minor one if the system is untestable? From what I have read elsewhere it seems as though the forcings haven’t been nailed down. Like in the pendulum example, simply by observing the pendulums location we would have no way of deducing the number (or strength) of electromagnets and arms.

    - How do you know that the model embodies the forcing properly? Back to the pendulum example; how do you know how powerful each magnet is? What is the coefficient of friction for each segment on the pendulums arm? We could make a bunch of guesses, but would have no way of knowing if they are right or wrong– its untestable.

    I guess what I’m getting it as is this: How do you know that the ensemble output is any better, or worse, than simply picking a number at random? How do you test it to make sure that it is better/worse than simply picking a number at random? I understand you check the ensemble spread, but anyone could say ’0 +/-5′ and be pretty much right.

  14. Dikran,

    I think you miss the point when you say: “I am not unduly surprised if the (Santer) test gives different results for different timespans”.

    The key point is that when you use the full time period of the observations, incorporating all available data from 1979-2009, the model projections significantly differ from the obs. This should give one pause about the models.

  15. @Will “- How do you know that a forcing is a major or minor one if the system is untestable? ” mainly from the observations (including paleoclimate) and from physics/experiments. I didn’t say that the projections are not testable, just that a fair test should be based on what the modellers claim to be able to do, not what they claim not to be able to do. If the observations lie within the stated uncertainty of the model, then the model is essentially performing as well as it claims to be able to perform.

    “How do you know that the model embodies the forcing properly?” The forcings are the input to the model, not part of it. However if you mean “how do you know the model responds to the forcing properly” then strictly speaking we know that they don’t because all models are wrong, as they are only approximations to reality. The question is whether the approximation is good enough to be useful.

    “I guess what I’m getting it as is this: How do you know that the ensemble output is any better, or worse, than simply picking a number at random? ” O.K., take a random number generator and see if it would have better predictive ability than Hansen’s 1988 projections. You will find that it wont.

    However, you are missing the point of the models, which is to tell us the plausible consequences of our actions according to our knowledge of climate physics. Which is more rational, making decisions based on the expert understanding of climatologists, or based on a random number generator? Yes obviously there is a flippant answer to that question, but ask yourself why we have doctors diagnose disease rather than flip a coin. If you think climatology is an imperfect science, then medical science is way worse, we are barel;y begining to map out gene regulatory networks for example, which are vitally important in understanding the body’s reaction to drugs, so why do we trust the knowledge of doctors, but not climatologists (I can assure you there is WAY more money involved in drug research than there ever will be in climatology)?

  16. certy as I have pointed out, the observations are only “significantly” different if you choose the right start and end dates, and choose a test that as I have pointed out is over-conservative. It shouldn’t give anyone “pause about the models” becuase scrutiny and consideration of the models should be continuous and ongoing, which is in fact exactly what the climatologists actually do. Go read the relevant article at Real Climate and you will find Gavin Schmidt, a climate modeller, openly discussing the shortcomings of the models.

    As I said, the observations are on the borderline of what the models consider plausible and there are three reasons why this may be (the models, the observations, that the unforced variation is currently highly unusual). I don’t see any reason why we should fixate on the models, we should keep an open mind about all three possibilities.

  17. Dikran, Again, no, it is not a matter of whether you “choose the right start and end dates”. This is simply choosing all the available data. When you are doing this and using 30 years of data, the old accusation of cherry picking just doesn’t hold water.

    I also don’t get your point that this is not a reflection on the models because “scrutiny and consideration of the models should be continuous and ongoing.” This is a non sequitur.

    We have a situation whereby a number of different published statistical methodologies show a significant difference between observations and model trends. Of course we should keep an open mind about why this is, but your reluctance to accept that this creates questions about the models is puzzling. As is you continual refusal to concede that the appropriate test is not your simple “spread of the ensemble”. Can you point to a single peer reviewed paper that has used this spread test?

  18. @Dikran Marsupial:

    I think you answered my question. :) The range of the ensemble is used to determine if the models are accurate by measuring the observation against the range of predictions.

    The models are tested, and they are validated (possibly against unseen data.)

    That sounds like a predictive measure to me. No need to talk about monte-carlo or pendulums. Data in, prediction out. :)

    Is this correct?

  19. @certy I didn’t say cherry picked, I was merely pointing out that the result of the test is not robust because if you change the start and end dates slightly, you change the result of the test. Given ENSO it is quite likely that if we wait a little longer the difference will go back to being not significantly again, but that won’t actually mean that anything has suddenly changed either.

    “I also don’t get your point that this is not a reflection on the models because “scrutiny and consideration of the models should be continuous and ongoing.” This is a non sequitur.”

    Given that I didn’t say that it is not a reflection on the models, then of course it is a non-sequitur. What I did say is that the change is the results of the test is not a cause for pause (for thought) regarding the models, the observations were bumping along the threshold already, which is reason for scrutiny of the model already BEFORE the result of the test flipped.

    “We have a situation whereby a number of different published statistical methodologies show a significant difference between observations and model trends. Of course we should keep an open mind about why this is, but your reluctance to accept that this creates questions about the models is puzzling.”

    I don’t know how more clearly I can say this than I have already, but the questions about the models were already there and being discussed in the litterature before the likes of Douglass et al were published.

    “As is you continual refusal to concede that the appropriate test is not your simple “spread of the ensemble”.”

    Well if you could point out why the test is incorrect (i.e. point out the flaw in the reasoning) then I might. However so far all you have done is point out that there are other tests, at least one of which is fundamentally incorrect, and one that I have pointed out is conservative.

    ” Can you point to a single peer reviewed paper that has used this spread test?”

    No, not off hand; to be honest I was rather surprised that Santer et al didn’t use it. Their test is only slightly conservative compared to the “spread of the ensemble” test, so it doesn’t make a great deal of difference. If the models currently fail the Santer test, they are very close to failing the spread of the ensemble test if they haven’t already. It is a bad idea to think of the tests as simply pass-fail, especially if the test is updated so the test it sequentially repeated.

    This is why I prefer Bayes factors, which are a continuous assessment of the relative plausibility of two hypotheses.

  20. @will, sorry it seems to me that the discussion is no longer being taken seriously, so I will leave it there.

  21. Dikran,

    Perhaps we are in agreement on one point – the fact that the models and observations are, in your words, “bumping along the threshold” means there is doubt on the accuracy of the models. Matt’s claim (B) that “the IPCC forecasts have been systematically too large” seem buttressed by this (though of course not proven).

    I already pointed out why the spread of the ensemble is a poor test. You can have a very poor model that is biased to the high side and another very poor model that is biased to the low side. The average of the ensemble is unchanged but you will have a spread so wide as to make your suggested test meaningless. As has been described in a number of blog discussions, this seems to the case we are facing in reality. Some less sophisticated models from less sophisticated countries were included in AR4. Ideally they would have been weeded out, but politics and courtesy dictated otherwise. So, if you are going to attempt a test of an ensemble spread nature, at a minimum you would need to eliminate all outliers. I can’t see you getting a simple spread test past peer review.

    As an aside, I have never said anything contrary to your point that “the questions about the models were already there and being discussed in the litterature (sic) before the likes of Douglass et al were published”. Why do you think otherwise?

    I also disagree with your statement that “the result of the test is not robust because if you change the start and end dates slightly, you change the result of the test.” MMH did not change dates “slightly”, they increased the time period by a full 50% (20 years to 30 years). Thirty years is a robust time period for a trend.

  22. @Dikran: I’m feeling a bit neglected here as you focused on a couple of firefights and missed my question of 3 February 2012 at 3:28 pm. I’ll rephrase and make it a bit briefer:

    The IPCC ensemble, as I understand your description, seems to refer to a single model with let’s say a dozen parameters of which a couple (CO2 being one) have been singled as out as “forcings” for exploration through a series of scenarios. The other parameters remain unchanged and their uncertainties are not reflected in the ensemble, nor is uncertainty in the original data. Is this correct?

    If so, it seems that the ensemble does not reveal any uncertainties of the model, but only the usual uncertainties associated with forecasting covariates in order to forecast a desired outcome. You’re playing “what if” assuming the model (and the data upon which it was tuned) itself had no uncertainties.

    Or am I misunderstanding it again?

  23. @Dikran: Reading @certy’s comment after just reposting my question, it appears that there actually is an ensemble of different models from different sources, as with weather forecasting ensembles. This affects my previous question to the extent that the models truly come from different perspectives and thus explore the space.

    I seem to remember you saying in an earlier posting that the ensemble was basically all based on an IPCC model, or something like that, and perhaps in my previous question I misconstrued that to understand this to be a single model. None the less, if the scenarios are common, the models must all be using the same visible parameters (forcings) and a host of invisible parameters (everything that’s tuned with historical data, or to back-cast reasonably). I don’t think there’s any statistical argument that would say that their invisible parameters would cancel each other out, even if they didn’t have many in common and even if they weren’t tuning to the same training set.

    Thoughts?

  24. Dikran – First, I want to thank you for providing the first easy to understand explanation of why some (e.g., Santer, Schmidt, and you) feel that the appropriate method for evaluating models is to see whether the observations fall within the range of the model spread. Never quite got it from either Gavin’s explanation or reading Santer. And I agree with you that this is an appropriate method IF the models are being used for their original purpose of trying to understand how climate works and how various components of climate interact. However, if the models are being used to make projections or predictions (use whichever term you prefer) about how future climate might unfold then we need to have a more stringent test that tells us something about the skill of the model at matching whatever is our parameter of interest. The compare to the spread method simply is not useful for that in that it essentially says all models , from those that should get a D- to those that deserve an A, are all good enough. In that case, if the claim is the ensemble mean is the best estimate of what will happen, then it is fair to directly compare the obervations (and assoicated error bars) to the ensemble mean (and associated error).

  25. My point is that sometimes different questions are better answered by different approaches.

  26. @Dikran Marsupial: Why are you saying that this discussion is not being taken seriously? If thats how you feel then I can assure you that there has been a misunderstanding.

    I appreciate your willingness to answer the questions of a complete stranger, but try looking it from the ‘other side’. When I brought up a ‘random number’ model you said:

    O.K., take a random number generator and see if it would have better predictive ability than Hansen’s 1988 projections. You will find that it wont.

    But haven’t you’ve been saying, all along, that the very test you just proposed isn’t applicable??

    You also said:
    “If the observations lie within the stated uncertainty of the model, then the model is essentially performing as well as it claims to be able to perform”

    This would imply that the model is not performing as well if the observations lie outside of the stated uncertainty of the model. Great! We are out of the world of monte-carlo and chaos, and in the land of quantifiable testing. That’s a bad thing?

  27. @certy I already said that if you are concerned about the test depending on the two most extreme models you can define the spread as 2 times the standard deviation of the model runs, which addresses that issue. There is no such thing as an outlier in the ensemble, they all represent outcomes that are plausible according to our understanding of the physics.

    Regarding the point about non-robusteness of the test, I was discussing the Santer et al test. I haven’t read the MMH paper yet, so I can’t comment on that test.

    The point that is continually being missed is when discussing whether the models are accurate or not, you need to know how accurate we could reasonably expect a model to be even if it were perfect. This error would not be zero, it would depend on the plausible magnitude of unforced variability, which we have no means of estimating other than by the models we seek to test. This is the key point.

  28. Wayne, the forcings are the input to the models, they are not parameters. We don’t know what these inputs will actually be in the future, so the IPCC have a set of scenarios that they consider to be representative of what might happen. The reason they use the term “projections” rather than “predictions” is to emphasise the fact that they are projecting what might plausibly happen IF the scenario of forcings happens.

    As you have gathered from certy, the multi-model ensemble has models from various different groups, which covers some of the uncertainty in modelling community regarding climate physics. The modellers also perform “peturbed physics” runs, where they can examine the sensitivity of the projections to changes in the physics in the models.

    “You’re playing “what if” assuming the model (and the data upon which it was tuned) itself had no uncertainties”

    the models are indeed playing “what if” simulations, but the model uncertainties do contribute to the spread of the ensemble run, so they are considered.

    It is a mistake to think of the models as if they were statistical models calibrated on training data. They are mostly constructed from knowledge of physics and only some parts of the models are “parameterised” and calibrated with data. If the models were very sensitive to the training data you can be sure that a skeptic scientists by now would have come up with a GCM that showed that the warming can be explained without CO2, but no such model has been implemented.

    Hope this helps, however I am only a statistician (who has worked with model output), for detailed questions about the models it would be better to ask someone like Gavin Schmidt at Real Climate, who is heavily involved in that side of things.

  29. BobN, we can’t assess the quality of the models without knowing how well a perfect model would be able to predict the observed climate, and we have no way of estimating that from a single datapoint (the climate on the Earth we can actually observe). The only estimate we have of this is from the models themselves.

    If we just choose the model that gives the best prediction of the observed climate it may just be that the systematic bias in that particular model by random chance matches the effects of unforced variability on the observations, and hence may well give worse predictions than the multi-mode mean.

    It is a bit like predicting the outcome of one roll of a biased die using an ensemble of biased die. Say the die in question is biased so that it gives high numbers more frequenty than low numbers, but on the occasion we actually observe a roll we get a two. In this case the model from the ensemble giving the best prediction is likely to be one that is biased low and hence will give worse predictions of future rolls and will probably be worse than just averaging over the ensemble.

    We know that none of the models in the ensemble are “true”, our knowledge of the physics means they are all plausible, and we don’t have enough data to confidently rule any of them out. So we are better off keeping all of the models and having a broad spread of projections which honestly represents our uncertainty.

  30. @Will, the uncertainty of the projections is of vital importance. In statistical decision theory we should evaluate the expected loss by averaging the losses for each outcome, weighted by its plausibility according to the model. You need to stop focussing on the accuracy of the models, and consider the importance of the uncertainty.

    We can only determine if the models are accurate if we know the plausible magnitude of the effects of unforced variability. If you know how to estimate that without the models we are testing, then explain how.

  31. @dikran marsupial: I don’t think I am ignoring the importance of the uncertainty. I don’t have a problem with saying that a prediction has an upper limit in terms of accuracy. I accept your position regarding upper and lower bounds. You seem determined to avoid any kind of testable metric relating to the models though, and this I find puzzling.

    I cannot abandon the concept of testing against observations, and that is what you seem to be asking me to do. To do otherwise would be an act of faith.

  32. @will there is an absolute testable metric, which is whether the models can explain the observations (in the sense that the observations lie within the spread of outcomes that the models consider plausible). There are relative metrics as well, we can see if one model predicts the observations better than another in a least-squares sense for example. However the problem is what does that actually mean?

    Nobody has abandoned the idea of testing the models against the observations. That is exactly what Ben Santer’s research group does, for example. There is a chapter in the IPCC report devoted to exactly that topic. The point is that in testing you have to understand what you can reasonably expect in terms of accuracy (e.g. Santer et al.), and what would be an unreasonable expectation (e.g. Douglass et al). They key to that is understanding what the modellers claim their models actually do and base the test on that, rather than on something they do not claim their models do (or even worse something they would tell you they can’t do).

    I am not against testing, I am for fair testing, and open mindedness and skepticism where the statistical evidence is equivocal.

  33. Pingback: The Strata-Sphere » The IPCC That Cried “Wolf!”