Statistics

Why multiple climate model agreement is not that exciting

There are several global climate models (GCMs) produced by many different groups. There are a half dozen from the USA, some from the UK Met Office, a well known one from Australia, and so on. GCMs are a truly global effort. These GCMs are of course referenced by the IPCC, and each version is known to the creators of the other versions.

Much is made of the fact that these various GCMs show rough agreement with each other. People have the sense that, since so many “different” GCMs agree, we should have more confidence that what they say is true. Today I will discuss why this view is false. This is not an easy subject, so we will take it slowly.

Suppose first that you and I want to predict tomorrow’s high temperature in Central Park in New York City (this example naturally works for any thing we want to predict, from stock prices to number of people who will vote for a certain USA presidential candidate). I have a weather model called MMatt. I run this model on my computer and it predicts 66 degrees F. I then give you this model so that you can run it on your computer, but you are vain and rename the model to MMe. You make the change, run the model, and announce that MMe predicts 66 degrees F.

Are we now more confident that tomorrow’s high temperature will be 66 because two different models predicted that number?

Obviously not.

The reason is that changing the name does not change the model. Simply running the model twice, or a dozen, or a hundred times, does not give us any additional evidence than if we only ran it just once. We reach the same conclusion if instead of predicting tomorrow’s high temperature, we use GCMs to predict next year’s global mean temperature: no matter how many times we run the model, or how many different places in the world we run it, we are no more confident of the final prediction than if we only ran the model once.

So Point One of why multiple GCMs agreeing is not that exciting is that if all the different GCMs are really the same model but each just has a different name, then we have not gained new information by running the models many times. And we might suspect that if somebody keeps telling us that “all the models agree” to imply there is greater certainty, he either might not understand this simple point or he has ulterior motives.

Are all the many GMCs touted by the IPCC the same except for name? No. Since they are not, then we might hope to gain much new information from examining all of them. Unfortunately, they are not, and can not be, that different either. We cannot here go into detail of each component of each model (books are written on these subjects), but we can make some broad conclusions.

The atmosphere, like the ocean, is a fluid and it flows like one. The fundamental equations of motion that govern this flow are known. They cannot differ from model to model; or to state this positively, they will be the same in each model. On paper, anyway, because those equations have to be approximated in a computer, and there is not universal agreement, nor is there a proof, of the best way to do this. So the manner each GCM implements this approximation might be different, and these differences might cause the outputs to differ (though this is not guaranteed).

The equations describing the physics of a photon of sunlight interacting with our atmosphere are also known, but these interactions happen on a scale too small to model, so the effects of sunlight must be parameterized, which is a semi-statistical semi-physical guess of how the small scale effects accumulate to the large scale used in GCMs. Parameterization schemes can differ from model to model and these differences almost certainly will cause the outputs to differ.

And so on for the other components of the models. Already, then, it begins to look like there might be a lot of different information available from the many GCMs, so we would be right to make something of the cases where these models agree. Not quite.

The groups that build the GCMs do not work independently of one another (nor should they). They read and write for the same journals, attend the same conferences, and are familiar with each other’s work. In fact, many of the components used in the different GCMs are the same, even exactly the same, in more than one model. The same person or persons may be responsible, through some line of research, for a particular parameterization used in all the models. Computer code is shared. Thus, while there are some reasons for differing output (and we haven’t covered all of them yet), there are many more reasons that the output should agree.

Results from different GCMs are thus not independent, so our enthusiasm generated because they all roughly agree should at least be tempered, until we understand how dependent the models are.

This next part is tricky, so stay with me. The models differ in more ways than just the physical representations previously noted. They also differ in strictly computational ways and through different hypotheses of how, for example, CO2 should be treated. Some models use a coarse grid point representation of the earth and others use a finer grid: the first method generally attempts to do better with the physics but sacrifices resolution, the second method attempts to provide a finer look at the world, while typically sacrificing accuracy in other parts of the model. While the positive feedback in temperature caused by increasing CO2 is the same in spirit for all models, the exact way it is implemented in each can differ.

Now, each climate model, as a result of the many approximations that must be made, has, if you like, hundreds (even thousands) of knobs that can be dialed to and fro. Each twist of the dial produces a difference in the output. Tweaking these dials, then, is a necessary part of the model building process. The models are tuned so that they, as closely as possible, first are able to produce climate that looks like the past, already observed, climate. Much time is spent tuning and tweaking the models so that they can, at least roughly, reproduce past climate. Thus, the fact that all the GCMs can roughly represent the past climate is again not as interesting as it first seemed. They better had, or nobody would seriously consider the model as a contender.

Reproducing past data is a necessary but not sufficient condition that the models can predict future data. Thus, it is also not at all clear how these tweakings affect the accuracy in predicting new data, which is data that was not used in any way to build the models, that is, future data. Predicting future data has several components.

It might be that one of the models, say GCM1 is the best of the bunch in the sense that it matches most closely future data. If this is always the case, if GCM1 is always closest (using some proper measure of skill), then it means that the other models are not as good, they are wrong in some way, and thus they should be ignored when making predictions. The fact that they come close to GCM1 should not give us more reason to believe the predictions made by GCM1. The other models are not providing new information in this case. This argument, which is admittedly subtle, also holds if a certain group of GCMs are always better than the remainder of models. Only the close group can be considered independent evidence.

Even if you don’t follow—or believe—that argument, there is also the problem of how to quantify the certainty of the GCM predictions. I often see pictures like this:
GCM predictions
Each horizontal line represents the output of a GCM, say predicting next year’s average global temperature. It is often thought that the spread of the outputs can be used to describe a probability distribution over the possible future temperatures. The probability distribution is the black curve drawn over the predictions, and neatly captures the range of possibilities. This particular picture looks to say that there is about a 90% chance that the temperature will be between 10 and 14 degrees. It is at this point that people fool themselves, probably because the uncertainty in the forecast has become prettily quantified by some sophisticated statistical routines. But the probability estimate is just plain wrong.

How do I know this? Suppose that each of the eight GCMs predicted that the temperature will be 12 degrees. Would we then say, would anybody say, that we are now 100% certain in the prediction?

Again, obviously not. Nobody would believe that if all GCMs agreed exactly (or nearly so) that we would be 100% certain of the outcome. Why? Because everybody knows that these models are not perfect.

The exact same situation was met by meteorologists when they tried this trick with weather forecasts (this is called ensemble forecasting). They found two things. The probability forecasts made by this averaging process were far too sure—the probabilities, like our black curve, were too tight and had to made much wider. Second, the averages were usually biased—meaning that the individual forecasts should all be shifted upwards or downwards by some amount.

This should also be true for GCMs, but the fact has not yet been widely recognized. The amount of certainty we have in future predictions should be less, but we also have to consider the bias. Right now, all GCMs are predicting warmer temperatures than are actually occurring. That means the GCMs are wrong, or biased, or both. The GCM forecasts should be shifted lower, and our certainty in their predictions should be decreased.

All of this implies that we should take the agreement of GCMs far less seriously than is often supposed. And if anything, the fact that the GCMs routinely over-predict is positive evidence of something: that some of the suppositions of the models are wrong.

Categories: Statistics

120 replies »

  1. Each horizontal line represents the output of a GCM, say predicting next year?s average global temperature.
    Vertical?

    The same person or persons may be responsible, through some line of research, for a particular parameterization used in all the models.
    Yes. This happens for many reasons. One is that when a model requires many parameterizations, each must be developed by specialists. One or two seem to work best in the limited circumstances in which they are tested. (The reasons testing is limited is that it costs to much to test under any and all circumstances.) These begin to be adopted, and if you are writing a full model containing many parameterizations, in most fields, it ends up being easier to publish your paper if you stick with the parameterizations that are most commonly used: this minimizes the likelyhood a reviewer will turn down your work because of the choice of parameterizations.

    If it is the great parameterization, that’s great. But if it’s not, then many models end up biased in the same way.

    In the end, the proof that the choice parameterizations resulted in predictive ability is to test against data collected after projections were made. Simpler tests are better than more complicated ones, and right now, the predictions are on the high side.

  2. Your examples are mostly fine, but your conclusion does not follow. Firstly, the models *are* different – some have similar pedigrees but in no case is the same model being run under a different name – that is just a strawman argument. You are however correct in stating that they are not completely independent and so cannot be assumed ‘a priori’ to be independent draws from some underlying distribution of all possible models.

    The models are based on similar underlying assumptions (conservation of energy, momentum, radiative transfer etc.) but which are implemented independently and with different approximations. If you ask the question, what are the consequences of the underlying assumptions that are independent of the implementation, you naturally look for metrics where the models agree. Those metrics can be taken as being reflective of the underlying physics that everyone agrees on. This is clearly not sufficient to prove it ‘true’ in any sense (there maybe shared erroneous assumptions), but it clearly must be necessary.

    Think about the converse of your claim – i.e. that disagreement among models makes their results more believable. That is clearly absurd. Therefore, agreement between models should increase the believability of any result. Other reasons to think any result more credible would be clear theoretical and observational support for such an effect and a match between the various predicted amplitudes of any signal.

    You also make errors in assuming that a) ‘hundreds’ of parameters are tweaked to improve the model performances. This is nothing like the case (at most people play with maybe half a dozen); b) that there is only one metric that is useful and therefore only one model that is best. Unfortunately this is woefully simplistic. There are hundreds of interesting metrics, and no one model is the best at all of them. Instead most models are in the top 5 for some and in the bottom 5 for others. You could maybe discard 3 or 4 of them on such an analysis, but not enough to make any difference. The fact that the average of all models is a better match to the observations than any single model, implies that there is some unbiased and random component to their errors.

    Finally, you make a fundamental error in your treatment of model-data discrepancies. Any such comparison is based on multiple assumptions – that the data are valid and represent what they say they do, that the hypothesis being tested is appropriate (i.e. what is the driver of any changes in the model), that there really is a signal that can be extracted from the noise that is not part of the hypothesis and that the model is a valid representation of the real world. Thus a mismatch must perforce drive an examination of all these aspects and prior to that you cannot simply say the models are incorrect. Indeed, climate science is littered with past controversies where it turned out the data were the problem and not the models (CLIMAP, MSU etc.), or where the noise in short period data precluded a significant identification of a signal.

    In the particular case you appear to be alluding to, none of the blogosphere analyses have even looked at the real distribution of the model output on such short time periods and so their claims of dramatic mismatches (based IMO on an underestimation of the impact of intrinsic variability) are very likely to be wrong. This data is available from PCMDI and I would encourage you and others to download it and do this properly. You will find that for such short periods, the models give very varied results depending on (amongst other things) the phase of their tropical oscillations.

    You recently visited a modelling center where I’m sure any of us would have been happy to discuss the philosophy of modelling and how it really works – it’s a shame you did not avail yourself of that opportunity. You are of course welcome to return at any time.

  3. I guess I don’t find the above particularly trenchant because anyone who has been around a race track or tried to predict the stock market has run into the same issues on a much simpler scale – but IMHO it is the same issue. It might be more informative to think about models of any physical process that were strongly relied upon for forecasting and were replaced by new models that were far more successful at forecasting. The question then becomes one of the nature of the changes to the model that result in dramatic improvements in their ability to forecast. I suspect, but clearly don’t know, that the differences in the the two sets of models are dramatic – capturing paradigm shifts rather than greater computational precision or nob twindling.

  4. Gavin,

    Thank you very much for your comments. You make some excellent points, including, though I’m not sure you would say this, agreeing with me on most details; though certainly not in philosophy.

    We agree the models are different. I see that I did say so in the original post; the weather model example was to focus attention on the problem, and I hope we can agree that is a useful example in this sense. I say the GCMs are not very different. I think you are saying they are very different. I say they are similar enough so that they are alike in at least the sense that ensemble weather forecasts are. I believe that last example I gave is relevant, by which I mean that “averaging” the output of them and doing nothing else makes you more certain than you should be.

    I am confused where you say, in your second paragraph, “This is clearly not sufficient to prove it ?true? in any sense.” I am puzzled why you have the word true in quotes, for one, and I admit to not being able to follow the rest of the argument. Would you care to re-explain it?

    We disagree about the number of tweaks that can be made. I do claim hundreds. I will concede that all tweaks are not all equally important. Every numerical approximation scheme, of which there are many, consists of parameters or a choices of one routine over another somewhere in the code, and these are an example of the very small knobs that can be twisted. But that is neither here nor there, and I don’t think we really disagree a lot about this in the sense that we both believe that tweaking parameters can cause changes in the output. And that models are certainly tuned to fit past data. That is all I really want to show here.

    I do not, however, in any way, claim that converse of my original argument. That is, ” the more the models disagree, the more we should believe them.” I’m with you 100% here: that is a silly statement. I think unlike you, and I might be wrong about what you are saying, I do believe that there is one correct model, which would be the one that always predicts the observations perfectly. Of course, we do not know what this model is. But I stand by the philosophical argument I made about “GCM1” being the best therefore the rest are not independent etc.

    You’re quite right that I presented a very simplistic example of model output. The actual output of GCMs is hugely multi-dimensional, and one model can be good in one dimension and bad in another, in just the same places where a second models is bad then good. However, while this has importance, it is not a convincing argument that we should ignore the “one” output which even the IPCC routinely highlights, and is the one I used in my examples.

    Now, you say that “the fact that the average of all models is a better match to the observations than any single model, implies that there is some unbiased and random component to their errors.” Hmmm. It is false, actually, to claim that because the average does better that there is “some unbiased” component to the errors. Each model may be biased, but the average can still be better. I hope you will trust my mathematics here, but if not, we can probably work up an example. It is true that there are “random” components to the error, but only if you believe, as I do, that “random” means “unknown.” So if you look at the output, for example, you might not be able to guess why the model went wrong (in a particular place and dimension).

    Do you mind if I hold off on your claim where you (in paragraph 5) say that I “make a fundamental error in your treatment of model-data discrepancies”? This is, I do agree, a very tricky point to understand, and I think if we went over it in the comments it would get lost. I do promise to answer, however. This is our biggest point of disagreement, and represents a truly fundamental difference of philosophy. To give you a highlight: if a model predicts 7 and the observation is 8, then I say the model is wrong. I mean, in plain English, it is wrong. But I think I understand your (classical) argument, and I want a fuller discussion of it.

    I am appreciative of your generous offer to teach me how modeling really works and so forth. I might just take you up on your offer of another visit.

    Briggs

  5. This brings me to one of the major problems I have with the models. They have different input values of very basic variables, like climate sensitivity, yet they can all be made to fit the observed changes in surface temperature. How can this be? The reason is pretty obvious, actually, that the models all did so, not becuase they are all correct (which is impossible) but becuase they were ~made~ to. Every modeler knew the answer ahead of time. They use “aerosols” and “ocean delay” as highly “adjustable” fudge factors. Natural forcings are also unknown, and can be “adjusted”. The models can match history, not becuase they are good models (they aren’t) but becuase they have been ~made~ to do so. On the other hand, if you test the models with measurements other than those they were adjusted to fit, they almost invariably fail miserably, every one of them, to match what we see there. For example:
    http://icecap.us/images/uploads/DOUGLASPAPER.pdf
    If every model agrees, it probably is becuase they are all doing the same thing wrong.

  6. BTW Gavin, I find your statement that measurements are frequently wrong rather than the models disturbing. Measurements can be wrong, but the reality is that no good scientist assumes that theory is right and measurements wrong. If we go down that road, we get epicycles and aether. MSU turned out to have problems, but contrary to the constant assertions of the thugs at RC, they still haven’t been brought in line with models. David Douglass (one of the authors of the above paper) and I have the same opinion of all of this-we were taught that if your theory does not match the observations, your theory is wrong. Did we get bad educations?

  7. i) does model argeement imply ‘truth’? Truth is in quotes because neither you nor I can ever define the true state of the climate (or any observed feature within it), and so every statement is about an approximation to the real state of the world. Specifically, take the situation of stratospheric chemistry in the early 1980s. Most of the chemistry involved in ozone depletion was known and all these models agreed that the decline in strat. ozone would be smooth but slow (in the absence of CFC mitigation). They were all wrong. The decline in strat ozone in the Antarctic polar vortex was fast and dramatic. The missing piece was the presence of specific reactions on the polar stratospheric clouds that enhanced by orders of magnitude the processing of the chlorine. Thus the model agreement did demonstrate the (correct) implications of the (known) underlying chemistry, but obviously did not get the outcome right because the key reactions didn’t turn out to be included by any model. Hence agreement, while necessary, is not sufficient.

    ii) models are not tweaked to reproduce all past data. They are tweaked to fit modern climatology and some intrinsic variability (like ENSO). That tweaking does not involve hundreds of parameters (consider how long it would take to do even if someone thought it useful). Read my description of the model development process (http://pubs.giss.nasa.gov/abstracts/2006/Schmidt_etal_1.html ) and see what is actually done. Specifically the models are not tweaked to produce the ‘right’ sensitivity (even if we knew what that was). That emerges from everything else. If we really played with hundreds of parameters don’t you think we’d do a better job?

    iii) IPCC has hundreds of graphs showing different model metrics, and rightly so. Most of the important impacts are in some way tied to the global mean temperature change, and so that is used as a useful shorthand. But don’t confuse iconisation of specific graphs with a real statement about importance. Would you pick the one model that has the best annual mean temperature, or the best seasonal cycle or the best interannual variability, in Europe? in N. America? in Africa? I guarantee no one model is ‘best’ on all those metrics.

    iv) It is not a priori obvious that the mean of multiple models should outperform the best of any individual one. This remains an unexplained but interesting result. The upshot is that you can treat the model ensemble like a random sample to reduce errors. Of course all of the models are biased (as is the mean, but less so) and if I suggested anything different, I apologise.

    v) if the model predicts 7 and the observations say 8, you cannot say anything about the usefulness of the model without a) an understanding of the uncertainties in the model predictions (as a function of the underlying physics, their implementation and the hypothesised driver(s)), b) the uncertainties in the observations, and c) what a naive model would have predicted (so that you can judge whether your model had any skill). For instance, if the model predictions were for 7+/-1 and the obs were 8+/-1 and my naive model (say no change) implied 0, I’d be pretty happy with prediction. If the uncertainties were an order of magnitude smaller, then there would be a clear discrepancy, but the model would still be skillful compared to the naive prediction. Remember George Box’s admonition – all models are wrong, but some are useful.

  8. Andrew, if you want a serious discussion, don’t start by calling me a thug. The Douglass et al paper is fundamentally flawed as has been pointed out many times (maybe Matt can go through the argument for you – it relates to the uncertainty in the estimate of the mean being compared to any draw from the same distribution). The statement that sometimes the models have been right and the data wrong is factually accurate. Why then must we assume that the any mismatch is because the theory is wrong? Instead, I suggest you should maintain an open mind and examine every point at which there might be an error – that includes the models, but it also includes the data and the hypothesis. Anything else is just dumb.

  9. Andrew:
    Gavin is essentially correct. At any instance, the measurement can be accurate or inaccurate and the model can be accurate or inaccurate. The trick is figuring out which cell you are in at any particular point in time and how to get to the sweet spot, i.e., accurate data and accurate model. One of Matt’s points is that just because you appear to be in a sweet spot at one point in time does not mean that you actually are, viz., Gavin’s Ozone example.

    This is a very courteous site and I for one would like to keep it that way.

  10. The problem I have with models is that they are only “Models” of reality. The problem I have with climate models and modelers is that they refuse to declare or define testable cases that would invalidate their pet models.

  11. Gavin will you take-on this challenge from Roger Pielke?

    The test of the dynamical core fits into these evaluations and assessment of the global climate models as prediction tools. As a necessary condition, when configured to run in a multi-decadal predictive mode they should still be used to make short-term global weather predictions in order to asses their skill at simulating the development and movement of major high and low pressure systems, including tropical cyclones. Moreover, they should be run as seasonal weather predictions using inserted sea surface temperatures at the initial time in order to see if they can skillfully predict the development of El Nino and La Nina events, as well as other circulation patterns such as the North Atlantic Oscillation. If they cannot accurately predict these short term and seasonal weather patterns, they should not believed as valid and societally useful prediction tools on the regional (and even the global average) scale decades into the future.

  12. I would normally not jump into this arena, but I think Bob is expecting too much out of these models…

    I like to think that GCMs and Climate similarly compare in how Newtonian Mechanics and Quantum Mechanics/Relativistic Motion fit together.

    Sure, you can model a lot of interactions that humans can see using GCMs and Newtonian, but for extreme events in either direction the model will never really be “true”.

    In the spirit of education, I would love to find info on any GCMs that “project” the next glaciation. I haven’t had time to research that many models, or maybe the MSM just doesn’t report about them, but are there any places I can find info on models that predict climate that follows the ice core data?

  13. First, I would like to point out that I also fundamentally disagree with the statement “if a model predicts 7 and the observation is 8, then I say the model is wrong. I mean, in plain English, it is wrong.” I think the choice is not binary and that Gavin is correct in that as long as we understand that the model is imperfect, we can still learn useful information from it. Also, as I pointed out in the previous post, models are necessarily approximations, and judging the usefulness of an approximation is often an exercise in judgment and intuition – unfortunately these are not rigorously quantitative, but that is how science actually works. Briggs, you have promised a fuller discussion of this which I eagerly await.

    As far as the model parameter tweaking is concerned, I agree that in a literal sense there are probably hundreds of approximations and parameterizations used in a climate model (although, as previously stated, I have no direct experience in this particular field). However, the point is that many of these are not tuned at all. I would argue that fact they are not specifically tuned, and since many different codes make slightly different approximations, the fact that the models yield similar results means that these are probably adequate assumptions and not terribly pertinent to the models’ output. Obviously, this is not a rigorously true statement (the fact that someone made a good guess at a parameter does not imply that the parameter is unimportant), but it seems to me to be the only way to proceed if we want to even attempt to make such models. I suppose that’s an argument someone could make – that we have no business making making models in the first place – but I would strongly disagree.

  14. Wade, then I would submit with your logic that the models are therefore unsuitable for any policy decisions such and actions as suggested by Kyoto agreement

  15. The numbers generated by GCM calculations cannot be shown to converge; where converge is used in the numerical-methods sense. As the size of the discrete temporal and spatial increments are refined the numbers for all dependent variables uniformly approach constant values at all spatial and temporal increments. That is, the solutions of the discrete approximations (or series expansions) approach the solutions of the continuous equations.

    Given this situation, the numbers generated by the GCMs are simply numbers and nothing more. Numbers that are unrelated to solutions of the continuous equations of the models. These approximate ‘solutions’ to the discrete equations then become the model. The continuous equations are not the model, no matter how may times ‘conservation of mass, momentum, and energy’ is repeated. If the numbers generated by the discrete approximations cannot satisfy the continuous equations, mass, momentum, and energy of the physical system are not calculated, much less conserved. Even given that the continuous equations exactly describe the physical system. We’re all in agreement that they do not. The model continuous equations do not describe conservation principles of the physical system; they are a model of the physical system.

    The real-world-application order of the discrete approximations is not known. Given that deterioration in some performance metrics is observed as the sizes of the discrete increments are refined, I suspect the order is actually less than one. Algebraic parameterizations, especially those that are functions of the independent variables, can lead to the observed behavior.

  16. Solution meta-functionals can be calculated by any number of incorrect models and methods ‘that work’. That does measure anything at all relative to Validation and Qualification for the intended applications.

    I would hope that models/codes the results from which might impact the health and safety of people all over the planet are based on science much more fundamental and sound than “E pur funzionano?”

  17. Gavin, have you fully considered the implications of considering those error estimates? No? Think hard on it, then you will see the problem.

    An anonymous post at RC doesn’t count as real rebutall. I believe Christy remarked that it must have been written by someone of “significant inexperience”.

    Bernie, it is okay to consider the possibility that both are wrong, but Gavin, whether he admits it or not, certainly hasn’t and won’t consdier the possibility of the models being wrong. This has been going on for quite some time. The search for measurement errors can go on till the cows come home, but sooner or later you’ll have to recognize how improbable it is that the measurements, rather than the models are wrong.

    BTW, I’m not trying to be rude, and I too would like to maintain polite discourse.

  18. BTW calling a paper “fundamentally flawed”, um, who’s close minded? Just curious.

  19. Gavin, I have a few questions/comments

    When you write:

    1) “There are hundreds of interesting metrics, and no one model is the best at all of them. Instead most models are in the top 5 for some and in the bottom 5 for others.”

    Do modeler/scientist know in advance which model is better at handling/forecasting each specific metrics or is it determined after modeler/scientific were able to compare them with new data?

    2) “The models are based on similar underlying assumptions (conservation of energy, momentum, radiative transfer etc.) but which are implemented independently and with different approximations.”

    If models are based on assumptions, who is to say that they are the right assumptions ?

  20. Gavin makes the excellent point, attributed to George Box, that “all models are wrong, but some are useful.” The usefulness of models fall into two broad classes: theory and prediction. Theoretical models attempt to map known physical, chemical, and biological relationships. Predictive models (sometimes called “black box”) attempt to make accurate predictions.

    There is a strong tendency to confuse or combine these utilities, and that is true is any modeling (my specialty is forest growth and yield models). Proponents of theoretical models are often adamant that their models are best (a value judgement) and insist that they be used in predictive situations. Predictive modelers, in contrast, may use crude rules of thumb that are unattractive to theoreticians, but predictive modelers emphasize that their goal is accurate prediction.

    Hence the assertion that models are wrong must also be bifurcated. Theoretical models are wrong if the theories behind them are invalid. Predictive models are wrong if they make poor predictions. It is easy (but not useful) to confuse these wrong-itudes.

    The best weather prediction models are more empirical than theoretical. They look at current conditions (fronts, pressure gradients, jet streams, etc.) as they are cadastrally arrayed across the globe, and compare those to past dates when the same or very similar arrays occurred. Then the weather outcomes of the similar past conformations are examined, and use to predict the immediate future weather. Not much theory to that, more of a data mining of the past; hence the descriptor “empirical.”

    Climate models are much more theoretical because we basically lack empirical data about past climate. Some attempts are made to use proxies, sunspot data, Milankovitch cycles, etc. but the data are sparse and time frames vary widely. In general we can predict a decline in temperatures and a return to Ice Age conditions based on fairly good evidence at a long time scales, but when and how that slide will occur is imprecise at short time scales. When theoretical GHG “forcings” are included in climate models, empiricism is almost completely absent.

    So we are in a situation where theoretical climate models are being used to make short-term predictions. Further, those predictions have generated some fairly Draconian suggested measures that are extremely distasteful, at least to many people. More taxes, less freedom, “sacrifices”, economic disruptions etc. are being recommended (imposed) based on the predictions of theoretical models. Political “solutions” to fuzzy predictions from “wrong” and improperly classed models are greatly feared, and I think properly so.

    The discourse cannot help but become impolite in this situation. Neither “side” is immune. How much better it would be if we realized that we cannot predict the climate (in the short term) and instead prepared to be adaptable to whatever happens, while preserving (enhancing) as much freedom, justice, and prosperity as we possibly can.

  21. I find it funny that when confronted with the task of declaring benchmarks or testable conditions the “modelerers” just become silent. This is an unsupportable position and I predict sometime in the near future this will come to a head and they will be forced to answer.


  22. Gentlemen,

    I was forced to do the work of my masters yesterday, and will likely have to do so today, so I did not and might not have the time to follow all the comments until tomorrow (except I might tackle Box’s popular but false statement today).

    However, there can be no more use of words like “thug”. If these sort of things crop up, I will delete them in the future.

    Whether or not GCMs are useful in predicting the future is a matter of fact, and we should be able to decide the question without resort to uncivil language.

    Also, calling a paper or statement “fundamentally flawed” is perfectly reasonable if that paper or statement can so be shown. There is nothing inherently closed minded or ungracious about this.

    Thanks.

    Briggs

  23. oops, the second sentence in the first paragraph of my comment at #17 should read:

    That does not measure anything at all relative to Validation and Qualification for the intended applications.

  24. There is a wide range of knowledge among the posters on this thread and a fair few mistaken statements. I’ll try and address some of the more relevant.

    First off, weather prediction models are not empirical searches for similar patterns in the past, instead they are very similar to climate models in formulation (though usually at higher resolution). The big difference is that they are run using observed initial conditions and try to predict the exact path of the specific weather situation. Climate models are run in boundary condition mode and try and see how the envelope of all weather situations is affected. The actual calculations are very similar and depending on the configuration, a climate model can do weather forecasts and weather forecasting models can do climate projections.

    Pielke’s suggestion is interesting but not a necessary condition for climate models to be useful. There is no evidence that climate sensitivity or the climatology (for instance) is correlated to performance in weather forecasts. In any case, these tests are being done. The Hadley Centre for instance uses the same model for both weather forecasts and climate.

    However, statements that climate model projections include ‘less freedom’ among their outputs are just ridiculous. There is no ‘politics’ subroutine in these models and I have no idea how the freedom-CO2 feedback would be quantified let alone coded. Confusing a scientific situation (i.e that increasing GHGs lead to warming) with the political decision about what to do about that information is extremely unhelpful. Model outputs do not determine political decisions, politicians do, and if you have a problem with them (as I’m sure we all do) take it to them. Implying that climate models are wrong because some politicians use them to justify political decisions you don’t like is fundamentally unscientific. Radiative transfer does not care who you vote for.

    Back to actual modelling though, Dan’s point, which he makes all the time, is that because models are approximations to the real world and he can’t check every line, they can’t possibly be useful. My slightly tongue-in-cheek reply was a shorthand for explaining (again) that model outputs have been tested on hundreds of cases of ‘out of sample’ cases (paleo climate LGM, 8.2kyr, mid-Holocene, responses to volcanoes, ENSO, 400+ papers from the PCMDI archive) and found to perform well (if not perfectly) in many of them. If the models were so impossible to make, why do they do so well? That isn’t to say they couldn’t be made better, or clearer, of course they can – but declaring that until they are perfect, they are useless, is logical leap too far.

    Finally, I’m impressed that Matt thinks that Box’s aphorism is false. It’s seems self-evidently true – models are models of reality, reality is more complicated than we can ever model, therefore all models will fail to match the real world in some detail and therefore all models must, perforce, be ‘wrong’. That some are useful is shown very clearly by weather forecasting models. QED. What is relevant from this is that focusing on binary issues like right/wrong are not as worthwhile as the less rhetorically satisfying (but more constructive) quantifications of the degree of usefulness. But I look forward to the contrary argument.

  25. Gavin you still have not addressed my posts about suitable test cases. If Roger Pielke’s test case is not acceptable then you propose one. In the end if climate models cannot be validated then they are not worth more then a bucket of warm spit

  26. Read it again Bob. Evaluation of models is going on all the time on ‘out of sample’ data. Where you may be having a problem is in realising that scientific predictions from a model are not limited to what has to happen in the future. They can predict consequences to changes in the past that we might not know about yet, or they can make predictions for things that might be seen in future analyses of current data. But even for future projections, the models have been shown to work pretty well – Hansen’s 1988 runs are a great example. As are the predictions (made before the fact) for the magnitude of the Pinatubo cooling (Hansen et al 1992). These evaluations will be ongoing of course, but you appear to think that we are starting from scratch here. We aren’t.

  27. The main problem I see is that usefulness is tightly related to correctness. The difficulty is with the long term. The farther you look into the future, the more accurate the model must be for it to provide useful data; otherwise, you will get a propagation of errors. When dealing with tenths of a degree Celsius, being out by 10% over 5 years is not so bad, but over 50 or 100 years it becomes a big problem. The absolute amount of the difference between reality and the forecasted value only grows over time.

    Another key difference is to whom are the models useful. Both scientists and politicians can use the models, but they use them differently. How they get used also determines their usefulness to each group.

    To a scientist, the models can be constantly tweaked and modified as new observations arise. As Gavin suggested, they can also be used to indicate other areas where data may be captured through observation. This can aid in advancing our understanding of the world around us. A scientist can hypothesize that increasing CO2 levels will cause the globe to warm. That seems to fit the data for a recent 20 year period, but not during all times. Of course, it could be that the evidence of the past is not as accurate as required, so more observations are continuously made in an effort to prove the theory. If the hypothesis is wrong, up to several thousand scientists will be affected. The net result is that scientists gain a better understanding on the relationship between CO2 levels and climate, or stock market prices and other influences during a recession.

    Politically though, other factors come into the equation. Outside of funding the science, thus when it comes to making policies, cause and effect become more important. Politicians must balance the uncertainty of the models with other models like economic and environmental. We have already found that solutions to model predicted scenarios cause other hardships like rising food prices. The effects of the predictions themselves have to be realized and evaluated before solutions are determined. Politically, there is much more at stake. If the hypothesis is wrong, billions of people will be affected. The models are just predicting a small piece of the puzzle. Political solutions need much more than that. Politically, the usefulness of the models are overrated. They would have to be much more accurate to be useful to a politician than to a scientist.

    John M Reynolds

  28. A little clarification.

    Blog posts and comments don’t have the advantage of internal peer review and review by ‘upper management’ before thoughts are exposed to the public. Those processes are very useful for filtering out use of lax and somewhat imprecise or inappropriate use of language. I have very likely used imprecise language in the heat of Blog discussions. As an aside I would sometimes insert into my reports and papers words and phrases that I knew would get the attention of ‘upper management’ so as to help them focus on something and get them distracted from my main points.

    It seems to me, after a few years of attempting to discuss real issues with the GCM software, that diversion into unstated areas is a tactic that is frequently employed. Gavin simply threw out a phrase that diverts the discussions from useful practices that are SOP for all other software. And these practices are especially applied to software that might affect public policy decisions. Other diversions are ‘it can’t be done for our software’ and ‘it costs too much’. When many know that it is done on a daily basis and the costs must be weighed against the consequences of inappropriate applications of black-box software by unqualified users.

    Gavin says: ” … but declaring that until they are perfect, they are useless, … ”

    I am aware that perfect is basically unattainable for the class of software that is under consideration. And I am also certain that it is very unlikely that I have ever used that word in this context. I have not insisted on perfection.

    I have probably used worthless, and maybe useless, but I hope that I have used those within the context of my major objections. The main objection that I have is the use of research-grade software to set policies that will affect the health and safety of the public. So far as I know this has never, and I know never is a very long time, before been done. In the case of GCMs, I view them as being based on a process-model approach to a very difficult problem resulting in research-grade software that can in fact be useful as a research tool. I frequently also object to the over-blown mother-hood and apple-pie use of ‘conservation of mass, energy, and momentum’; especially the mass and energy parts. Let’s face it, the fundamental un-approximated forms of the complete continuous equations have yet to be coded and the modified and coded equations have yet to be actually solved.

    I do state explicitly that software that has not had Independent Verification and Validation procedures and processes applied is worthless relative to being used to set such policies. I also require that such software be maintained and released for production applications under approved and audited Software Quality Assurance procedures. There are other equally important aspects of production-grade software that I would also require before release of the code for production use. Qualification of the users, for example, when doing applications that might affect public policy. A more general and detailed discussion is available here. I have specific citations to peer-reviewed publications scattered around several discussion sites and can supply these to anyone interested.

    Inter-comparisons of numbers calculated by different models/codes is the least acceptable method of demonstrating ‘correctness’. The peer-reviewed literature on engineering and scientific software Verification and Validation explicitly discusses the many issues associated with this approach. Its application can provide guidance, but even then only under very limiting conditions. It is considered to be one of the seven deadly sins of software verification and validation. To propose use of this method as a best defense doesn’t buy any points in the software Verification and Validation community.

  29. Additional, more explicit, clarification.

    I just noticed that in his introduction Gavin says, ” … and a fair few mistaken statements.” And then when discussing my comment says, ” … – but declaring that until they are perfect, they are useless, is logical leap too far.”

    I did not make that statement. Gavin did.

  30. An interesting discussion. I too am looking forward to Matt’s discussion of model correctness versus usefulness. Here is a few cents of input on the subject.

    To say that a model is “correct” is to say that it offers up skillful predictions, where “skillful” means improvement over a naive baseline. The definitions of “improvement” (in the context of uncertainties and ignorance) and “naive baseline” (in the context of trends, and well-known relationships) are, as we have seen, contested.

    To say that a model is “useful” opens up a entirely different set of complications. One definition is that the predictions from the model shape the treatment of alternative possible courses of action before a decision maker. This could include enlarging the set of possible options or reducing that set.

    But in a wide range of contexts there is no necessary relationship between correctness and usefulness, which might be a surprising claim to some. Consider that I may come up with an astrologically-based model (i.e., grounded in myth) that will turn out to accurately predict the winner of the US presidential election this year, and based on that I model I bet the farm on the outcome. Surely I will have judged that model “useful” as I count my winnings. Of course, an astrologically-based model will be far less useful in a long game of poker. But that this the point. The decision context matters a great deal.

    We discuss much of these complexities in the following book chapters, for those interested in a bit more detail:

    Pielke, Jr., R. A., 2000: Policy Responses to the 1997/1998 El Ni?o: Implications for Forecast Value and the Future of Climate Services. Chapter 7 in S. Changnon (ed.), The 1997/1998 El Ni?o in the United States. Oxford University Press: New York. 172-196.
    http://sciencepolicy.colorado.edu/admin/publication_files/2000.08.pdf

    Pielke, Jr., R.A., 2003: The role of models in prediction for decision, Chapter 7, pp. 113-137 in C. Canham and W. Lauenroth (eds.), Understanding Ecosystems: The Role of Quantitative Models in Observations, Synthesis, and Prediction, Princeton University Press, Princeton, N.J.
    http://sciencepolicy.colorado.edu/admin/publication_files/2001.12.pdf

    Pielke Jr., R. A., D. Sarewitz and R. Byerly Jr., 2000: Decision Making and the Future of Nature: Understanding and Using Predictions. Chapter 18 in Sarewitz, D., R. A. Pielke Jr., and R. Byerly Jr., (eds.), Prediction: Science Decision Making and the Future of Nature. Island press: Washington, DC.
    http://sciencepolicy.colorado.edu/admin/publication_files/resource-73-2000.06.pdf

  31. The Fleet Numerical Meteorology and Oceanography Center (or FNMOC), known prior to 1995 as the Fleet Numerical Oceanography Center (FNOC), is a meteorological and oceanographic center located in Monterey, California. A United States Navy facility, it prepares worldwide weather and oceanographic forecasts every six hours, which are made available to the public by the National Oceanic and Atmospheric Administration. Meteorological observations use an EMPIRICAL atmospheric data base which is queried for every weather prediction. Current methodologies have evolved from the global, primitive-equation model (GPEM) which used a staggered, spherical, sigma-coordinate system with real input data interpolated to the sigma surfaces to a constant feedback loop system using REAL DATA crunched in state-of-the-art silicon graphics super computers, enabling even higher-resolution meteorological and oceanographic products that are the BEST weather predictions in the world.

  32. re 27. Gavin. I still cant get to the CIMP data. Any word on when they will open it up more?

    One thing would be really easy. ModelE output of GSMT from
    1850 to 2000. That’s simple enough a vector of 150 numbers. just start with the simple stuff, when CIMP opens up a bit more then guys can amuse themselves with that and I’ll stop bugging you about it. OH, Lief would like an answer to the Question he asked you about excentricy, he asked you over at Tamino

  33. It is with trepidation that I enter this discussion.

    Am I right that Gavin is a stalwart of RealClimate?

    If so there is an interesting roundtable at The Bulletin Online,
    http://www.thebulletin.org/roundtable/uncertainty-in-climate-modelling/
    which Gavin kicks off.

    He mentionsthat there are 20 or so climate groups around the world developing climate models and that each group makes different assumptions about the physics to include and the parameterizations. He then goes on to say of the models –

    “Thus while they are different, they are not independent in any strict statistical sense”

    Is’nt this what William was saying in his blog?

  34. I believe the 1988 Hansen runs were discussed on CA and shown not to be valid on a strict sense.

  35. I do not wish to seem hard-nosed about it, but the statement that “a climate model can do weather forecasts and weather forecasting models can do climate projections” is simply factually incorrect. If this discussion is about models, we should be clear about what models we are talking about.

  36. My 2c, for grins: IMHO, the GCMs are just big overfitted models. Who was the famous statistician that said he could fit an elephant with 3 or 4 parameters. If you have “about” 6 tuneable parameters (as Gavin says GCMs do), I suppose you could then make the elephant dance and blow bubbles. I also note that there is a lot of disagreement about even the SIGN of the “feedbacks” that are incorporated into these models. They may ALL be using the wrong methods to get the “right” hindcasting. If so, they can’t be trusted for future predicitons.

  37. None of the GHG forcings in Hansen’s scenarios A,B or C are valid therefore I consider it not to be a proper test case.

  38. Personally I would have preferred to see Gavin continue his discussion with Matt. Alas the explicit and implicit animosity towards Gavin makes it unlikely he will think that it worth the trouble. It is very unfortunate.

  39. 1. Why is 100% of the model(s) code used by the IPCC not available in the public domain?

    2. Has any of the model(s) code or data sets been updated since the the last IPCC report? – If so it invalidates all previous model runs for that model

    3. Can the computer models fill in the blanks for what is not known in the real world? – If not then they are irrelevant for prediction unless everything in them is 100% understood and correct.

    There is no “close enough” with computer results. Computer results can only be right or wrong. And all the models are wrong.

  40. Dan’

    re# 35 Thanks. For some reason that bit doesn’t show up in my browser

  41. All,

    Again I apologize for my delay (well, I suppose it only seems like a delay, give the speed at which things are expected these days; sometimes this speed causes us, me at least, to say things we later regret). I am very happy with the comments; many useful points are being made.

    I will hold off on my discussion of “what is a model and what makes a good one” until later, because the I point I wish to make is philosophical, quite general, applies to models other than just GCMs, and I think it would be distracting here. For now, I will concede that “GCMs can be useful” and I’ll leave ‘useful’ vague. I am here only interested in the predictive ability of GCMs, though I of course agree that these models are useful in their explanatory power (again, leaving ‘useful’ vague). I suppose we can say, with Gavin, “E pur funzionano?tranne quando non” and leave it at that.

    Here I do want to make one major point that is getting a little buried. First some details.

    1) I am willing to weaken my argument and say that the number of knobs is small, even just one if you like (in general, the more knobs, the more the models might offer independent evidence). But this (aggregate) ‘knob’ is still tuned so that the model fits past data. Gavin is correct to point out that ‘past data’ does not mean ‘all past data’, which I accept. Still, the models are tuned to fit some past data, which, as I said before, is a necessary but not sufficient condition to ‘usefully’ predict future data. (By ‘past data’ I mean ‘data not in the future’, so this includes today’s known outcomes.)

    2) I think we all agree that different models can do better predicting outcomes at various dimensions, and so might offer less dependent evidence at some of those dimensions. One of these dimensions might be, say, the height of a certain pressure surface at a given latitude and longitude. This dimension may be rich with interest and provide many deep insights to climatologists. However, it has no direct interest to those who are making decisions based on GCM output. Those dimensions of interest are small in number, and though my main point is in no way dependent on this, this is my reason for choosing the one dimension which is in everybody’s mind, global mean temperature (GMT).

    3) I hope we can all agree that if the GCMs cannot usefully predict future data, then there is something, who knows what, wrong with them (this statement is of course still dependent on what ‘useful’ means). For a crude and unrealistic example: if all GCMs predict a GMT greater than 20 for next year but the observation is 15, then something has gone wrong, and thus the raw GCM output should not have been used to make decisions.

    4) To say we can never “define the true state of the climate” is probably true because of the enormous multidimensionality of the problem, but to say we can never define “any observed feature within it” is false, even obviously false. If we cannot identify any feature, then nobody would ever attempt to write a GCM, because how could you ever tell if the thing worked? The model has to have some concordance, however you want to measure that, with actual climatological features. I do not think anybody claims that the measurement error in observations is so large that we can never use the observations to corroborate GMC output. Still, measurement error must be accounted for, and I am very happy to have this better known: any increase in measurement error should increase our uncertainty that we are making useful statements about the climate. But for the purposes of this discussion, let us assume that measurement error is negligible for the dimensions of interest, specifically GMT (if it is not, then my point is still valid, but what follows becomes more complicated).

    5) The cartoon graph I drew above is important, because it somewhat mimics the actual state of affairs with respect to our dimension of interest. We want to find some way of taking the different outputs and combine them to make a probability forecast for the future observable. We cannot just take the raw average, because that is still just a point estimate, nor, as it has been correctly argued, is there any a priori reason to assume the average is the best method of combining. We cannot also just use the raw output and form a probability forecast, in the manner that produced the black curve. I’ll repeat: suppose all the GCMs gave the same forecast for next year. Would we be 100% sure that the actual temperature will equal that forecast? Obviously not. The width of that black line needs to be widened to match the actual uncertainty. The analogy with weather forecasts is apt, because the techniques to treat output like our dimension of interest would be the same. What happens is that each model is corrected separately for bias (which may differ from model to model, or even not exist in some models), and then these corrected versions are “added” together in such a fashion that the width of the probability distribution of the combination is wider than it would be using just the raw output. I do not claim that the methods currently used in weather models are the best statistical models, merely that they have shown some utility. These methods, or whatever new ones arise, are just what are needed to quantify the amount of independent information each GCM offers. (The comments I made about “the most useful model or groups of models” are still true, but I can see that these are a distraction here: these arguments, too, are very general and apply to models of any kind, but you needn’t take my word for it here.)

    6) Which brings me to the main point, which I still hold is true: we are too certain of the forecasts made by GCMs, either singly or in aggregate. I have seen little or no evidence that they have skill (see Pielke’s definition above). And GCMs are almost certainly not calibrated (I cannot prove this, but I have not seen much in the literature that this well known criterion has been routinely considered for GCMs). For example, the (combined) output might claim that there is 90% certainty that the future GMT will be between 14.5 and 15.5 degrees. It is my guess that statements like these will be found to be true only around 40% of the time, which, to avoid any math, is another way of saying the forecasts are overconfident. In order that any forecast (GCM or not) to be useful in making decisions, it has to accurately quantify the uncertainty of the future observable. If it does not, then decisions made using this forecast will be sub-optimal at the least, or just plain wrong at the worst. And given the, let us say, enthusiasm to make decisions as quickly as possible re: global warming, my concern is that we are in danger of making many sub-optimal or wrong decisions.

    I do promise to answer the questions of “model goodness” soon (‘soon’ as in ‘real life soon’ not ‘blogosphere soon’).

    Briggs

  42. If we are too certain of the forecasts made by GCMs such that that we are in danger of making many sub-optimal or wrong decisions, then we are back to the usefulness topic. Trying to avoid that area, I would like to add some thoughts as to the accuracy, how it is measured and the effects of the base assumptions. This is the best way I can figure to argue the usefulness without having to define the word useful.

    Not enough past years are being matched accurately enough. We may not need to go back millennia, but we should be able to approximate the temperatures going back to the late 1800s. I say approximate since many of the temperature data are poor, especially from the oceans, due to difficulties in collecting them. Assuming that the data are close enough, then how far into the future can we then predict. The longer back we match the data in the past, I would like think that we can go a similar distance predicting into the future, but I fear that the propagation of error would be too great beyond the 20 year mark. We will end up with forecasts that are as accurate as the data collecting was 100 years ago. In other words, it will be easy to be off by half a degree for large areas of the globe.

    Models have to be proven to be accurate. That takes time. So far the GCM have been shown to be inaccurate, so they are modified with each iteration of newly acquired yearly average data in an effort to match the new data. Each IPCC report has shown large differences in their forecasts/what if scenarios due to these modifications to the models. We can evaluate which models have been most accurate. Have none been accurate at all over the past 20 years or have they all required significant modifications? Then again, does it automatically follow that the models that were the most accurate for the past 5 years will continue to be when predicting such a volatile system such as climate? Sometimes the small modifications to the models are not enough. That they have not taken into account the PDO and AMO would mean that they could be inaccurate for a decade or so. This would require us to trust the models even though they are not currently accurate. Sometimes something new is found or created that invalidates large chunks of the models.

    There are the assumptions upon which the models are based. In all models things change. Life happens. As people adapt to the systems that are put in place, by passing new laws for example, the assumptions of the model have to change. The forest yield models will change if a new predator like the pine beetle moves in. Another example is if laws prevent logging in a nearby forest such that dead material builds up and allows a massive forest fire to spread, much farther than if the law was not put in place, that wipes out the area that was to be logged. The assumptions as to the best logging and forest fire techniques will have to change to match the new conditions. The climatology branch of science is so young. The initial guesses are being investigated and found lacking. The new data from the Aqua satellite is going to have a profound impact on the models.

    I submit that the climatology field is just too young to be relied upon to make political decisions. The models can’t be accurate enough until we learn more from all sides of the debate.

    John M Reynolds

  43. Dan Hughes said…

    “I do state explicitly that software that has not had Independent Verification and Validation procedures and processes applied is worthless relative to being used to set such policies. I also require that such software be maintained and released for production applications under approved and audited Software Quality Assurance procedures. There are other equally important aspects of production-grade software that I would also require before release of the code for production use. Qualification of the users, for example, when doing applications that might affect public policy. A more general and detailed discussion is available here. I have specific citations to peer-reviewed publications scattered around several discussion sites and can supply these to anyone interested.”

    Dan, of course, has hit the nail on the head here. The problem I have with codes like the GISS model E is that many of these codes (in particular model E) are poorly documented. If you go the GISS website, there are no documents which tell you basic things like what differential equations are being solved, what boundary conditions, how they are discretized, what numerical procedures/algorithms are being employed, stability and error analyses etc. And if you look at many of the FORTRAN subroutines in the listings provided, they are very poorly commented. This is entirely ** unacceptable ** for a code which is being used to shape public policy decisions, as there is no way anyone can do an independent verification of the procedures and algorithms embodied in the software.

    Until the GISS and others get serious about documenting model E and similar GCMs and submitting these codes to independent verification and validation procedures, I will find it it hard accept the results they produce very seriously…

    PS
    For an example of good documentation, go here:

    http://www.ccsm.ucar.edu/models/atm-cam/docs/description/

  44. Contrary to our wishes, climate is not a boundary condition matter. It is also, at lest for the time range we are dealing with, an initial con dition affair.
    Initial assumptions of current climate models are, therefore, wrong.

  45. Wow. The most recent above remarks by Briggs and Reynolds are so well-stated there is nothing to add. Thank you, gentlemen.

  46. The above criticisms of the models are far too loose IMHO. If I had spent 10 years developing a model I would see these comments as largely gratuitous with a lot of hand-waving.
    John Reynolds indicates the models are not accurate without quantifying how inaccurate they are or at a minimum providing a citation. That most models have not precisely predicted the recent global temperature trend may be true but what is the level of accuracy/inaccuracy and how is it to be measured? I don’t disagree with the overall sense that the climate models are imprecise – but my guess is just that – an intuition. I was kind of hoping for a more explicit exposition on how one evaluates the “usefulness”, “accuracy”, “validity” of these models.
    This is surely an implicit promise once Matt’s initial epistemological point is acknowledged – without knowing how to evaluate “accuracy” we have no way forward. Dr. Pielke’s challenge is one approach though Gavin seems to have is doubts. Can someone put us back on track?

  47. I was kind of hoping for a more explicit exposition on how one evaluates the ?usefulness?, ?accuracy?, ?validity? of these models.

    Bernie: How one evaluates all these depends on what you what a tool to do.

    With regard to climate science (or any anyfield) models are tools used to do something. But everyone as different goals.

    One of the difficulties we see is that the IPCC projections, and the way the projections are disseminated would suggest that IPCC document authors are suggesting their models and methods can be used to predict a number of features of great interest to voters and policy makers. (And yes, I use the word predict. Because dictionaries recognize these as synonyms.)

    One of the major features highlighted in IPCC document projections is GMST (global mean surface temperature).

    Matt seems to be discussing utility, validity and accuracy in predicting the metrics the IPCC actually discussed in detail in the published documents.

    Gavin seems to be discussing the general sorts of verifications undertaken by modelers to estimate the utility and validity for other features. He is also discussing features associated with tracking done where things went wrong. If you are trying to improve an AOGCM, it is important to know whether the mismatch between what happened on the real earth and what happened in the model earth is due to applying incorrect forcings over time, or to a problem with some sort of parameterization in the model, or other features.

    But it’s not entirely clear that’s as important when simply observing that, the final result of the modeling process overall is not skillful. That process begins with estimating the forcing and ends with graphs and tables predicting (or projecting) temperatures. Either one can come up with useful projections or one cannot. Either one can come up with meaningful uncertainty intervals or not.

  48. All,

    Lucia, your summary is spot on. John Reynolds’s comments are also pertinent to verification questions, and well stated. The technical comments about code accuracy and efficiency by Dan Hughes and Frank K. give an idea why verification can fail for GCMs (‘fail’ in the sense of the models not being ‘useful’).

    Bernie, I do apologize for the lack of precision. You might have been looking for a statements like “The CSIRO GCM, with respect to GMT, is miscalibrated at the 90% level at 38.8%” and so forth. At any rate, something meatier about what exactly is wrong with each or any of the models. There is some of this sort of thing in the literature, but, actually, there is shockingly little.

    Although this post is about multiple models, suppose there is just one model: pick a model from Hansen’s GISS group, so you know you are getting one of the better ones (I am not being sarcastic). Look at the forecast from that model for next year’s GMT. It will be X degrees. Do you, or does anybody, have 100% confidence that the temperature will be exactly X degrees? If not, then you have my point: simply stating “X degrees” is misleading because it does not give any indication of uncertainty, and therefore of usefulness. Of course, nobody ever does believe “X degrees” but many act as if the uncertainty that does exist is trivial.

    Ok, what is ‘useful’? If you want to see some (very) technical material on this, click over to my resume page and look for any of my papers on skill, verification, scoring, measurement error, or ROC curves. And go to my friend Tilmann Gneiting’s page and look at almost any of his recent papers. None of his or my papers is easy going, they are all exceedingly mathematical, and make no attempt to explain things to a general audience. Still, what ‘useful’ means is described in great detail. I will be writing a summary for all this material, and going further, too, in talking about what a model is: this is where I will defend my claim that George Box’s “all models…” statement is false. I have a bit of that in my paper on Broccoli and Splenetic Fever, if you want a head start.

    Briggs

  49. Both Briggs and Lucia have declared that the GCMs are not skillful and that they routinely overpredict changes. GIven the neither have done any analysis of what models actually show, I’m a little puzzled as to where this certainty is coming from. In fact, they are not correct – the models are skillful compared to a naive assumption of no change and they are useful for certain metrics.

    Possibly they are being led astray by the thought experiment examples they have brought up. First off, the climate can be thought of as have a component that is changing due to an external forcing a(F) – where F is forcing, and a is the (uncertain) function that calculates the change that would occur because of F.
    But there is another component – e – the internal variability – which is chaotic and depends on exactly what the weather is doing. The atmospheric component of ‘e’ is only predictable over a few days, while for the ocean part, there might be some predictability for a few months to a few years (depending on where you are).

    So for any climate metric (whether it’s GM SAT, or the hear content of the oceans, or the temperature of the lower stratosphere), it’s evolution is:

    C(t) = a(F(t), F(t-1) …) + e(t)

    (i.e. the temperature depends on the history of the forcing and a stochastic component – which itself depends on the past trajectory). Climate models (as they were used in AR4) only claim skill in the first, forced, part of the equation. Given that ‘e’ is not zero, the usefulness of the model for any one metric depends on the relative magnitude of ‘a’ and ‘e’. If the forced part is 10 times the size of the stochastic part, then it would be useful to have a good estimate of ‘a’. If it was the other way around, then your estimate of ‘a’ may well be very good, but it wouldn’t be useful.

    In practice, climate models cannot estimate ‘a’ without having some realisation of ‘e’ as well. Since these models do not assimilate real world data as part of their simulations, the model ‘e’ will be completely uncorrelated to the real world ‘e’. Different realisations of the simulation will have an indpendent ‘e’ as well, and since all these ‘e’ will be uncorrelated, averaging the results together will give a better estaimte of ‘a’ – this is what is generally referred to as the projection.

    Despite Lucia’s insistence, there is a difference between a prediction, a projection and a forecast in these contexts. Predictions are a very wide class of statements that provide an estimate of some quantity dependent on a very strict set of conditions. They are the backbone of the scientific method, but aren’t restricted statements about the future. Projections (in the IPCC sense) are a subclass of predictions that include as one to the conditions, a particular scenario of future forcings. i.e. if a certain scenario comes to pass, then the model will predict x. A forecast (such as for the weather) can be thought of as a prediction that is not dependent on any scenario.

    Projections try to estimate ‘a’, forecasts try and estimate ‘a + e’. Since these are indeed different things, it makes sense to call them different names (whether this is reflected in the OED or not).

    Now, to the examples Matt used. The GMST in any one year is a great example of a metric where the ‘e’ is much larger in amplitude than the ‘a’ we expect from increasing GHGs. Therefore a good estimate of ‘a’ is not much use in estimating GMST in any one year. You need instead a forecast model that attempts to track ‘e’ (try Smith et al, 2007 for an attempt at that). Given that the forecast of the stochastic part will obviously get worse in time, when might the projections start to be more useful? This depends a little on the structure of ‘e’ – but in generally, projections will start to be more useful on the 10-15 year timescale (the time needed for ‘a’ to be comparable to ‘e’).

    An examination of the output of the models will reveal that for shorter periods the spread the of model results and their trends will be wide and not particularly useful.

    For other forcing (say volcanic), or for another metric (say, stratospheric temperature), the relative magnitudes of ‘e and ‘a’ will be different and different timescales will be needed for ‘usefulness’. For volcanoes, ‘a’ is so large that you can see it in a year. For stratospheric temperatures, ‘e’ is small and so you don’t need many years before you see a signal. I pick these two metrics because they are well known examples of where the models show skill. There are many more. One can always find a new metric where ‘e’ is larger and show that for any particular short period, the model estimate of ‘a’ is not useful. That is not the same as saying the model itself is not useful in general.

    It is therefore incumbent on people who are judging ‘usefulness’ to be very careful about the uncertainty due to stochastic processes as well as the uncertainty in the forced component.

  50. Gavin, your explanation above is somewhat confusing. In your equation

    C(t) = a(F(t), F(t-1) ?) + e(t)

    I assume the variable t is time. What is meant by (t-1)?

    But more to the heart of the matter, this model of climate metric temporal variations is simplistic at best (and perhaps misleading at worst). All of the climate models I’ve examined (including model E, sans documentation) solve a coupled system of ** non-linear ** partial differential equations for the temporal and spatial evolution of pressure, temperature, moisture, etc. I rather doubt that any realistic approximations of functions a(F,…) lead to robust solutions of C(t) except with simplistic forcing models and well-defined, controlled boundary conditions, many of which are evolving with the solution. So by tuning the boundary conditions and forcing/source terms, you can get any solution you want (or avoid solutions you don’t want).

    Hindcasting climate by tuning a(F) is certainly achievable, as has been demonstrated in numerous papers. Forecasting with any skill is, however, another matter…

  51. Gavin:
    You note:
    “Given that ?e? is not zero, the usefulness of the model for any one metric depends on the relative magnitude of ?a? and ?e?.”

    This makes perfect sense, but doesn’t usefulness of the models also depend upon the presumption of the stability of “a” – or the whole model is reduced to your e(t) term? And isn’t the stability of “a” dependent upon the size and magnitude of all the + and – feedback terms in the model.

    Your points on volcanoes make good sense – A citation on the skill projecting stratospheric temperatures would help me. The 10 to 15 year period for GMST is interesting, since that seems to me to in part address Matt’s confidence interval around any GMST projection. What would be the period for Ocean temperatures — Would it be appreciably shorter?

  52. “… the models are skillful compared to a naive assumption of no change …” — Gavin

    You are creating a strawman argument. No one is claiming that there will be no change. Instead, we are complaining that the IPCC scenario A has been exceeded with respect to the level of increase in the CO2 concentration. Scenario A was the do nothing approach. China’s emissions have increased significantly; meanwhile, the temperatures do not match. They are at or below the bottom end of scenario B. Indeed, the real difficulty is that the trend has reversed contrary to the assumptions the models began with. That is not for a single year as Gavin eludes. This has been over the past decade. The trend lines have been flattening even with the dirth of volcanic eruptions of the past decade.

    John M Reynolds

  53. Let me use the word Verification in the sense that the numbers generated by the IPCC ?what-if? scenarios should be required to be correct. Let?s say that ?should be required to be correct? means something along the following lines.

    Processes and procedures having defined and measurable objective evaluation criteria and associated success metrics have been applied to the coding, numerical solution methods, all data used in a calculation, and all user-defined options for a given calculation. And all the aspects tested by these procedures and processes have successfully met the success metrics. This is a good working definition for Verification for this discussion.

    I have the following questions of Gavin and anyone else who cares to express an opinion.

    Verification of the IPCC ?what-if? scenarios is necessary.
    Verification of the IPCC ?what-if? scenarios is not necessary.
    Verification of the IPCC ?what-if? scenarios is necessary before mitigation strategies are implemented.
    Verification of the IPCC ?what-if? scenarios is not necessary before mitigation strategies are implemented.

    The sole issue is Verification of the numbers produced by a calculation. Note that I have said nothing about Validation relative to physical reality.

    Footnote
    Let?s avoid extrapolation of the sense of the word to meaningless usages of ?perfection? and ?seeing every line of coding? and ?models are impossible to make?.

  54. Gavin: please provide a ref for the assertion that the Douglas et al paper has been summarily rebuted. There has hardly been time for a published rebuttal.

  55. Granting that the models have SOME skill does not mean that they have ENOUGH skill to base a complete change in our economic system (ok, turn in your car, you have to take the bus now). If my system is tossing a coin, my null model is .5 and I can specifiy a priori how many tosses I need to tell if the coin is a trick coin or not. For a regression problem, I can summarize the goodness of fit with R2. What is the degree of fit of the models to what data? I can’t seem to find anything except statements that it is “good” but the authors of the paper get to decide what they think is good. Other times, “good” means simply that the models agree, which brings us back to Briggs point above. In a detailed analysis of Hansen’s 1988 model projections on the globe at Climate Audit, it turned out that regional projections were not so good (oh, right, only GMT is valid…).

  56. The F(t), F(t-1)…. formulation was just intended to indicate a dependence on the history of the forcing (as opposed to the instantaneous value).

    Whether ‘a’ is stable or not is undetermined. We know that it is a function of the current climate and current feedbacks. In a much warmer or cooler world that might well be different. But for relatively short periods I think we can assume that it doesn’t vary dramatically – it would be very difficult to tell on a practical basis in any case.

    For matches to stratospheric temperatures, look at Hansen et al 2001 for instance, but note that stratospheric cooling was predicted by radiative transfer models back in the 1960’s.

    Craig, patience grasshopper, patience. But actually you don’t need to wait – just think about it for a bit. It really is that bad an error.

    PS. the original Italian was a reference to Galilleo – “and yet it moves” (Eppur si muove) when he was forced to recant the Copernican worldview.

  57. Gavin says: “Craig, patience grasshopper, patience.” If it is such an easy error, surely such a superior scientist can easily refer me to a reference.

    I suggest that the paper:
    P.J. Gleckler, K.E. Taylor and C. Doutriaux 2008. Performance metrics for climate models J. Geophysical Res. 113:D06104
    is relevant to this discussion. Fig 11 for example shows pretty good skill (low RMS error) for surface temperature (past 20 yrs) but pretty bad for precipitation.

  58. I’m not entirely sure what “verfication” means (even if the “numbers” are correct going to Dan’s inquiry). A model which correctly projects the temperature from 2020-2050 could be perfect, or it could project the right number but only by the luck that it overestimated radiative forcing and underestimated climate sensitivity, and those errors cancelled. Back to the title of this post, multiple climate model agreement IS exciting because in some sense we can take the averages of these results, and it is most certainly wrong that all these models are the “same thing” even if the basic physics is the same. I don’t do modelling work, but it seems to me that it is the degree in which the model closely models reality that it is a measure of it’s “usefulness.” But which reality is useful? Models can get the decadal temperature change, the pinatubo eruption good, the stratospheric cooling, the ocean heat content, the radiative imbalance, polar amplification, and many other things well. Models do not yet do a very good job of El Nino, the MJO, etc. Radiative forcing is known very well, but differences in feedbacks contribute to ~3x more to the range in climate sensitivity, a lot to do with uncertanties in clouds. OR for those on the blogosphere that make “uncertainty” and “no action” synonymous, maybe that climate sensitivity itself is not a number but a probability distribution (Roe and Baker?). How do I treat uncertanties and/or errors in model outputs (which I’ll say is the model producing some aspect of reallity which is outside the error range of what actually happens)? And how big is the error range for useful practical application– Hillary will win the next presidential election +/- 50%. Whoopi. Back to Gavin’s example, if I say 8 +/- 8 and the result is 17, is that “useful?”

    For policy application, I’m guessing the people who make decisions and are very interested in basing policy off of GCM’s would be more interested in some errors than others. For example, if I was going to look at a model for its “usefulness” and how much CO2 my factory emitted was based on how much I thought that model could predict future change I’d be a bit more interested in its ability to get temperature change right over the last century than I would its inability to simulate MJO. I’d be interested in the fact Meehl et al 2004 could just about perfectly simulate the temperature changes over the last century with greenhouse forcing, but only perfectly up to 1970 or so where the natural and anthropogenic+natural paths diverge significantly. The fact remains that there is no simulation of the coupled atmosphere-ocean-cyrosphere system that spontaneously simulates a change as large as that of a doubling of carbon dioxide in Holocene-like conditions, and there is no simulation that can explain the warming trend over the last 30 years without a discernable human influence. The fact remains that the overwhelming consensus of the scientific community, the modeling and empirical and paleoclimatic literature supports the notion that the cliamte will change signigicantly if we keep emitting GHG’s, and so knowing the century+ old physics might be enough for me to take action, and I’ll leave “improving models” for academics only.

    I need to put the uncertanties in perspective as well, because El Nino does not raise global mean temperatures on decadal to centennial timescales. If I can’t simulate El Nino, I might not base coastal activity decisions in Australia based on a GCM, but it doesn’t make sense for me to say “AGW is more questionable now, because that GCM doesn’t get El Nino right.” For “practical” purposes, it seems the decisions when confronted with imperfect models is whether the errors and/or uncertanties are sufficient to undermine the conclusion that a doubling of carbon dioxide will have dangerous effects on the climate system, or that human activities are changing the climate. But I’m an undergrad…what do I know.

    C

  59. Gavin: if your supposed simple flaw in the Douglass et al paper is that they compare single data sets (eg. UAH) to confidence intervals for the models, one can do this to test for outliers (you need not have two distributions, this is a 1-tailed test). In addition, Douglass et al in 4.1 case 2 state that “Even if the extreme of the confidence intervals of Christy and Norris (2006) and Christy et al. (2007) were applied to UAH data, the results would still show inconsistency between UAH and the model average” though it would have been nice if they showed this result.

  60. Gavin saidL
    Both Briggs and Lucia have declared that the GCMs are not skillful
    No Gavin. I said the IPCC modeling process as a whole is not skillful. According to the IPCC documents themselves, process is a multi-step one involving a range of models for different task. The final projections, and information conveyed to the public, relies on a hierarchy of models including some simplified ones tuned to AOGCMS that include more model physics.

    You wish to translate “models” into “GCMs” only. But there are other types of climate models. And these other types are involved in the IPCC projections communicated to the public.

    Gavin then said:Projections (in the IPCC sense) are a subclass of predictions that include as one to the conditions, a particular scenario of future forcings. i.e. if a certain scenario comes to pass, then the model will predict x. A forecast (such as for the weather) can be thought of as a prediction that is not dependent on any scenario.

    And interestingly enough, this mean the projections for temperature trends in the recent IPCC AR4 states are forecasts. For the near term, that document states, quite specifically, that for the full range of SRES investigated, the projections collapse in then near term. The projection has a central tendency of 2C/century. This means the distinction of different projections or predictions for different scenarios makes no different with regard to this group of predictions. All predict the same thing.

    Second: The projections as described in the IPCC documents are, no matter how you read them, communicated to the public as predictions. The word itself has the dictionary definition of prediction. (And the only other definition are such things as “we extended a line” , or “projected on a screen.)

    Finally: The entire IPCC process is presumably intended to create projections based on scenarios that could, conceivably occur. So, if the forcing that occur fall outside the full range of scenarios, then the IPCC prediction/projeciton method is flawed, and falsified.

    Your GCM’s might be safe. But that has little relevance to the argument I’ve made about the process as a whole, which also involves developing models to predict likely scenarios.

    So Gavin, if your only intention is to suggest that GCMs, in total isolation of every other model used by the IPCC to create predictions of climate, then it is true that those specifically are not falsified by GMST data. But they also aren’t confirmed in any sense– for precisely the same reasons you give.

    But the possibility that the fault does not lie in the AOGCM’s hardly clears the more general category of “climate models” and certainly doesn’t clear “the hierarchy models used by the IPCC to predict climate”. The IPCC themselves calls these climate models. And when I say models are fallsified by the data, I mean something in the models, used collectively to create recent projections has gone wrong.

    All this means is someone, somewhere, needs to go back to the drawing board and improve something. But presumably, climate modelers are doing that just as every modeler in every field everywhere is doing that in their areas.

    For what it’s worth, I have no particular dispute with your Reynolds decomposition into ?a + e?. We did that when I took turbulence too. 🙂

  61. lucia,

    the IPCC certainly does not disagree with the notion that if we curtain carbon emissions, then we will be able to avoid the more worse case scenarios. That is why there are different emission scenarios, and are thus *projections.* No one is in the business of fortune telling…”if x, then y” but “if a, then b.” It’s up to us to decide x or a. The temperature responses to the different scenarios certainly diverge by 2050 into scnearios which are bad, and others that are more manageable. I can’t predict what policy makers will do, or how countries will start to act, but I can tell you that if we choose to double CO2 then we’ll get around 3 C of warming. That is projection vs. prediction.

  62. Craig, The error in Douglass et al is that their estimate of the uncertainty in the model projections is instead the uncertainty in the determination of the mean of the model projections rather than the spread. It is exactly equivalent to throwing a dice 100 times and calculating the the mean throw to be 3.5 +/- 0.1 and than claiming that a throw of a 2 is a mismatch because 2 is more than 2×0.1 away from the mean. It’s just wrong! There are additional problems with that paper – for instance, they were given more up to date analyses of the radiosonde data which they did not even mention (probably because it did not support their thesis). The fact of the matter is that since tropical variability is large and the expected trend small (big ‘e’, small ‘a’) you need a longer time series to have a chance of saying anything much. I would counsel you not to get stuck defending the indefensible.

    Lucia, simple energy balance climate models cannot possibly give an accounting the weather noise. Therefore they are only appropriate for climate projections where weather noise is a small term. Therefore they are just not valid for short time periods where weather is important. IPCC conclusions about the central tendency of the trend over ‘the next few decades’ cannot be read as implying that ‘this is the trend to be expected for every subset of 8 years’. You could do the real validation using the AR4 archive and use all the actual climate model output. You would quickly see the futility of trying to find little ‘a’ in the presence of big ‘e’.

    As to what IPCC intended to be interpereted by the public, why not simply read the report:

    “Climate prediction
    A climate prediction or climate forecast is the result of an attempt to produce a most likely description or estimate of the actual evolution of the climate in the future, e.g. at seasonal, interannual or long-term time scales. See also: Climate projection and Climate (change) scenario.

    Climate projection
    A projection of the response of the climate system to emission or concentration scenarios of greenhouse gases and aerosols, or radiative forcing scenarios, often based upon simulations by climate models. Climate projections are distinguished from climate predictions in order to emphasise that climate projections depend upon the emission/concentration/ radiative forcing scenario used, which are based on assumptions, concerning, e.g., future socio-economic and technological developments, that may or may not be realised, and are therefore subject to substantial uncertainty.”
    http://www.grida.no/climate/IPCC_tar/wg1/518.htm

  63. “The F(t), F(t-1)?. formulation was just intended to indicate a dependence on the history of the forcing (as opposed to the instantaneous value).”

    This makes no sense at all. Either F(t) is a continuous function of time or not.

    In any case, your description of e(t) as “weather noise” is interesting. This begs the question – why bother with an Eulerian core in a climate model? Why not just add a “noise” function which you could adjust to taste? If you do choose to use an Eulerian core, what equations are you really solving? I’ve heard some climate modelers say they are solving the “Navier-Stokes” equations, when it appears they’re really solving some filtered formed of the equations applicable to only long time scales, perhaps similar to LES modeling (except without the subgrid scale modeling). Nonetheless, the filtered equations are still non-linear, subject to time-dependent boundary conditions (many of which you don’t know a priori), and thus can not be proven to be robust or accurate when integrated over long time periods.

  64. Here’s a model:

    “If we choose to double CO2 then we?ll get around 3 C of warming.”

    Stated without error, without a or e, without confidence intervals, indeed without empirical foundation, but with absolute confidence. And, stated with political implications since the premise was:

    “I can?t predict what policy makers will do, or how countries will start to act…”

    but if they don’t act in certain ways, it may lead to:

    “… scenarios which are bad.”

    That kind of modeling is not only without skill, it includes a dire conclusion, a threat, more or less, intended to spur and/or justify certain political actions.

    This is where climate modeling interfaces with reality, where climate modelers cannot back away and say their work is purely academic science and without consequence to political decision-making, abstract, ivory tower, somehow removed from the crudities of social interaction.

    History is filled with philosophers who would be kings, who espoused certainties that proved most uncertain, indeed wrong, and dire outcomes did indeed result from those errors of “science.”

    And this is the rub, though many would not care to say it or face it. The deficiencies in the model above are not academic and are not going to be fixed by fine-tuning equations, or by adding new factors, or by calculations of statistical significance or error bars. Confidence is not merely a statistical concept. Over-confidence is not merely a Bayesian peculiarity. There are real world consequences that will not be washed away with academic indifference or appeals to scientific theories.

  65. Re: Douglass et al 2007 “A Comparison of tropical temperature trends…” Gavin says: “Craig, The error in Douglass et al is that their estimate of the uncertainty in the model projections is instead the uncertainty in the determination of the mean of the model projections rather than the spread. It is exactly equivalent to throwing a dice 100 times and calculating the the mean throw to be 3.5 +/- 0.1 and than claiming that a throw of a 2 is a mismatch because 2 is more than 2?0.1 away from the mean. ”
    I am afraid that your analogy to dice is without skill. It is more like this. I meaure a group of students growth rates during 4 years of high school in inches. I get 22 measurements or growth rate, and calculate the mean and confidence intervals on this estimate. Now I compare this distribution to the height growth of my mother, who at 85 is shrinking. I can validly say that her rate of height growth is outside the 95% confidence intervals of high school students. One can of course play a game with the GCM outputs by adding in uncertainty due to parameterization such that the confidence intervals become so wide that no observation will disprove them, but a power test would then show that fact up front (ie, the models have no skill once you play this trick). You can’t have it both ways. By the way, I checked their calculations of confidence intervals from Table 2 and they are correct.

  66. Craig, Look at it again. Their calculation of sigma_SE is the uncertainty of the estimate of the mean, not the sigma of the distribution. Just use their formula for your height data and calculate the rejection rate for members of that original sample (put aside your mother for a second). For n=22, the Douglass test will reject ~2/3rds of the original sample. Which is odd for a test that supposedly has 95% confidence (i.e. it should only reject ~5%).

    Since Matt is happy to cheerfully undertake any statistical analysis, perhaps he’d like to chip in. Maybe you’ll trust his analysis over mine – but since this is just mathematics and not really climate science, I don’t see why it should make much difference (but it undoubtedly will, sigh…).

  67. Mike, in a more academic settin I’d discuss ranges and confidence intervals. Other than that, if you don’t like the conclusion, or if it doesn’t fit preconceived notions, it doesn’t make it wrong. That is what the science says, and the physics of radiative transfer and Clausius-Clapeyron and melting ice isn’t really concerned with am I a democrat or a republican, how people choose to use my information, etc. Your point this is with “political implications” is ignorant at best.

  68. My understadning is the Douglass and Christy paper choose only the model runs that produced a semi-realistic surface temperature because a model run that got the surface temperatures wrong does not provide any meaningful insight into the tropospheric trends.

    I also find it troubling that model advocates acknowledge that most models get many metrics wrong yet insists that the collective set of models provides useful information. For example, a model might correctly produce results that follow the GMST but produce precipitation patterns that are completely wrong. In my opinion, such a model tells us nothing about GMST or precipitation patterns because the model match to GMST could have been a fluke or a result of tuning.

    I also think it makes no sense to cherry pick metrics from different models and claim that they collectively provide meaningful insights into the climate.

  69. Raven, your understanding is wrong. They used all the models.

    Your point about which metrics to look at though is however an interesting thing to discuss and goes to the heart of Matt’s original post. i.e. Does it make sense to focus on metrics where models agree or not? The ones you hear talked about most often are the ones where the models tend to show similar things. For the metrics where the models disagree you don’t have much to go on – and so not much is said. For those metrics that are robust, where there is a good theoretical case to be made and there is some observational support, people are justified in taking that to be a reasonable projection (given our current understanding). If any of those aspects are missing, statements about the future must be more uncertain. GMST change are in the first category – all the models agree, the theory is very solid, and there is a match to observations – therefore it gets attention. Local rainfall changes are not robust among the models, there is little theory to guide you, and the observations are very noisy. Therefore local precip amount changes don’t get given that much attention. Some aspects of the precip changes are more robust – higher rainfall in high latitudes, sub-tropical drying, greater intensity of the rain as specific humidity increases – and have theoretical and observational support. It’s therefore not cherry picking to focus on metrics that pass some thresholds.

  70. I would like Gavin to respond to two points.

    1. In comment #8 you said: “The Douglass et al. paper is fundamentally flawed as has been pointed out many times…” You were asked twice to give the references but did not respond. As an author of that paper I also ask: Gavin, please give me the references.

    2. In comment #71 you state that there was data that that we did not use “probably because it did not support their theses.” This is not true. You are accusing me of scientific dishonesty, which I consider the most serious charge that can be made against any scientist. I ask you to withdraw that remark and I would like an apology.

  71. For Craig and Douglass, the realclimate post has not yet been responded to and should be a sufficient “reference” for here; you’ll have to ask Gavin if a refereed response is in the doing. Rather than dwelling on a “reference,” you should address the criticisms that several people have put forth.

    I don’t know about Douglass, but John Christy has crossed the line of “shady science” too much, so I’d be skeptical about the apology as well.

  72. Dr. Douglass, My criticism of your methodology was put on the RC website in December and neither you nor your co-authors have chosen to respond to the central issue (the inappropriate nature of your statistical test) anywhere that I can see. Those statements were considerably higher profile than these comments, so I find it odd that this is where you choose to respond. But please do explain why a test that would reject the majority of any training sample as being mismatched is appropriate for model-data comparisons. Publications in the technical literature will no doubt soon be forthcoming.

    As to your honesty, I have made no statement about it, and have no particular opinion. However, you were sent three versions of the RAOBCORE radiosonde data (v1.2, 1.3 and 1.4). You chose to use only v1.2 – which has the smallest tropospheric warming. You neither mentioned the other, more up-to-date, versions, nor the issue of structural uncertainties in that data (odd, since you were well aware that the different versions gave significantly different results). Maybe you’d like to share the reasons for this with the readership here?

  73. Gavin:

    IPCC conclusions about the central tendency of the trend over ?the next few decades? cannot be read as implying that ?this is the trend to be expected for every subset of 8 years?.

    I have never assumed the IPCC projections suggest we can expect that trend for every subset of 8 years. Who would?

    You could do the real validation using the AR4 archive and use all the actual climate model output. You would quickly see the futility of trying to find little ?a? in the presence of big ?e?.

    No Gavin. To comment on the skill of the IPCC’s projections or predictions, it necessary to test the information the IPCC published against what actually occurred.

    If you want to dig through the archive to find whatever it is you think will redeem the something about climate models, feel free to do so. But that exercise is irrelevant to any observations about the central tendency, or uncertainty intervals the IPCC communicated to the public.

    I don’t know why you think it is not possible to compare a predicted trend to the possible ranges of trends consistent with noisy data collected on earth. Or, more precisely, I don’t know why you wish to use criteria like number of years rather than relying on standard statistical methods that result in uncertainty intervals for trends based on the actual variability due to weather.

    But rest assured, comparing predictions of mean values– like trends– against noisy data is done routinely in many fields. And making claims about noisy data is not futile. There are caveats associated with any statistical test, but these tests exist and are used in a wide range of fields involving very noisy data.

    The fact that sometimes one gets ambiguous results for a hypothesis test when the amount of data is small doesn’t mean that one can never get ambiguous results.

    Small amounts of data tend to result in large amounts of “beta” errors (false negatives), but it doesn’t mean we can’t get correct positive results to the degree of confidence we would like. (That is, with alpha errors below some critical threshold.)

    And the finding that a hypothesis– in this case the IPCC’s AR4 projections– are outside the range consistent with the data is a positive result, from the point of view of statistics. And, taking into account weather variability that accually occurred, the projected trends, said to apply to all SRES, is inconsistent with the actual data.

    So, it is up to climate scientists to figure out if this inconsistency occurred because a) the full range of SRES are inappropriate b) the climate models used (whether simple or AOGCM) are not skillful given correct scenario information c) the method of determining the uncertainty intervals don’t account for real uncertainties or d) other.

    As for your observation that the simple models don’t account for weather noise: I am aware of this. Possibly, the IPCC should realize that it would be desirable to consider the effect of weather variability on the variability of measured trends using regressions. Any one who understands the concept of filtering underlying LES would surely realize that it would be entirely possible to estimate the variability in global mean temperature expected as a result of weather, and then fold that into estimates for uncertainty intervals. It would not be that difficult.

  74. Gavin
    I was unaware of the RC website comments about our paper until some time later (Why did not you send them to one of us directly for our response? We would have given you one.). I am accustomed to scientific discussion at this level via published papers in refereed journals. We expected any criticism of our paper to be a comment in a refereed journal so that we could respond also in a refereed journal. Do you know of any such comment of our paper that has been written?

    The unrefereed comments on the RC blog were unsigned To whom should we have addressed our unrefereed comments? (Did you just say that you were the author?). This is not the way good science is done.

    Contrary to your information we were never sent the RAOBCORE ver1.4 data (check your source). However, we did realize that we had not explained our use of ver 1.2 in our paper so we sent an addendum to the Journal on Jan 3, 2008 clarifying two points. The first point is quoted below.
    ——————–
    1. The ROABCORE data: choice of ver1.2.
    Haimberger (2007) published a paper in which he discusses ver1.3 and the previous ver1.2 of the radiosonde data. He does not suggest a choice although he refers to ver1.2 as ?best estimate.? He later introduces on his web page ver1.4. We used ver1.2 and neither ver1.3 nor ver1.4 in our paper for the satellite era (1979-2004). The reason is that ver1.3 and ver1.4 are much more strongly influenced by the first-guess of the ERA-40 reanalyses than ver1.2.
    (Haimberger?s methodology uses ?radiosonde minus ERA-40 first-guess? differences to detect and correct for sonde inhomogeneities.) However, ERA-40 experienced a spurious upper tropospheric warming shift in 1991 likely due to inconsistencies in assimilating data from HIRS 11 and 12 satellite instruments — which would affect the analysis for the 1979-2004 period, especially as this shift is near the center of the time period under consideration. This caused a warming shift mainly in the 300-100 hPa layer in the tropics and was associated with (1) a sudden upward shift in 700 hPa specific humidity, (2) a sudden increase in precipitation, (3) a sudden increase in upper-level divergence and thus (4) a sudden temperature shift. All of these are completely consistent with a spurious enhancement of the hydrologic cycle. Thus ver1.3 and ver1.4 have a strange and unphysical vertical trend structure with much warming above 300 hPa but much less below 300 hPa (actually producing negative trends for 1979-2004 in some levels of the zonal mean tropics). Even more unusual is the fact the near-surface air trend in the tropics
    over this period in ERA-40 is a minuscule +0.03 ?C/decade (Karl et al. 2006) and so is at odds with actual surface observations indicating problems with the assimilation process. This inconsistent vertical structure as a whole is mirrored in the direct ERA-40 pressure level trends and has been known to be a problem as parts of this issue have been pointed out by Uppala et al. (2005), Trenberth and Smith (2006) and Onogi et al. (2007). Thus we have chosen ver1.2 as it is less influenced by the ERA-40 assimilation of the satellite radiances.

  75. At Comment #62 above I gave a working definition for Verification. A definition that has been successfully applied to many models/methods/codes. I asked GISS/NASA’s Gavin Schmidt for his opinion relative to application of Verification procedures to the what-if scenarios produced by the IPCC. He has not yet responded. And I suspect he has no intention of responding. I realize he has a day job, but he has engaged here several times since my comment was posted.

    This outcome is consistent with almost all my other experiences with Real Climate and GISS/NASA’s Schmidt. As shown by his comments above in this thread he has generally extrapolated what I actually say into areas that are not useful for actual discussions of the real issues. My entire career has evolved around development of models of physical phenomena and processes and implementation of these into software. So how can it be possible that I said that making models is impossible?

    One objective I had for the very narrow focus of my Comment # 62 was an attempt to avoid getting led off on a perpendicular. Apparently that has been successful :).

    I’m thinking Gavin is following his own advise given in #71 above:

    I would counsel you not to get stuck defending the indefensible.

    Otherwise I consider the following questions to be very straightforward and easily addressed.

    I have the following questions of Gavin and anyone else who cares to express an opinion.
    Verification of the IPCC what-if scenarios is necessary.
    Verification of the IPCC what-if scenarios is not necessary.
    Verification of the IPCC what-if scenarios is necessary before mitigation strategies are implemented.
    Verification of the IPCC what-if scenarios is not necessary before mitigation strategies are implemented.
    The sole issue is Verification of the numbers produced by a calculation. Note that I have said nothing about Validation relative to physical reality.

    My views are as follows. So long as climate scientists and the IPCC want to play in the Academy Sandbox, Verification is not necessary. And so too will the journals that publish these numbers be also playing in the Sandbox. The journals, however, should openly acknowledge that the papers based on the numbers are purely research-grade what-if exercises and that the calculated results very likely have not been Verified to be correct and so are not archival-quality papers. The authors of the papers should in fact supply this information in the paper itself.

    The IPCC should also in the Assessment Reports firmly state that they are publishing research-grade numbers that very likely have not been Verified to be correct.

    And I think it is very unwise to base mitigation and adaptation strategies on calculated results that have not been independently Verified. (And Validated too, of course, but those processes and procedures can only follow Verification.) Well actually, so long as what-if scenarios are the subjects of the analyses how can focused strategies be developed?

  76. As a reader who is at an early stage of digesting the climate change debate, I sincerely hope that Gavin’s tactics are not representative of the current level of discourse.

    Gavin originally writes that Douglass et al did not use newer available data “probably because it did not support their thesis”. This is a quite serious accusation against an academic. He then refuses to retract this statement and claims “As to your honesty, I have made no statement about it, and have no particular opinion.” These two statements are clearly contradictory.

    I hope that Gavin takes the high road and produces the requested apology – particularly given David Douglass’s comprehensive response on this issue. If he refuses to do so, then it is not Dr. Douglass’s honesty that would be in question.

  77. Dr Douglass, thanks for the information. At minimum, I think we both agree that there are substantial structural uncertainties in the radiosonde data that are relevant to any model-data comparison. That these were not mentioned in the original paper is unfortunate, but this justification is better late than never.

    I am curious as to what the second point of clarification was?

  78. “So for any climate metric (whether it?s GM SAT, or the hear content of the oceans, or the temperature of the lower stratosphere), it?s evolution is:

    C(t) = a(F(t), F(t-1) ?) + e(t)

    (i.e. the temperature depends on the history of the forcing and a stochastic component – which itself depends on the past trajectory). Climate models (as they were used in AR4) only claim skill in the first, forced, part of the equation. Given that ?e? is not zero, the usefulness of the model for any one metric depends on the relative magnitude of ?a? and ?e?.”

    This statement bases on many implicit assumptions and several misconceptions .

    The right expression using the same variable swould be :

    C(x,t) = G[a(F(x, t), F(x,t-1) ?) , e(x,t)] where

    G is some non linear function of a and e
    e is the sum of unknown processes
    x is the space

    What is the difference between the first version and the second version ?

    – The space variable is missing in 1. That means that C is an integral over the space and severely limits the metrics that have a physical meaning . For instance temperature would correspond to the space integral of temperatures which has no physical meaninng . It could be restricted to temperatures on some arbitrary surface that would also lack physical meaning .

    – G is supposed linear in 1 . There is no reason that G be linear but there are many reasons that it not be so .

    – e is called “stochastic” or “chaotic” in 1 . Yet chaos is the contrary of randomness and is stochastic only for the high dimensional chaos . F.ex contrary to traditional notions, new experimental evidence indicates that the small turbulent scales are anisotropic, reflecting the overall character of the flow (Warhaft) from where follows that the use of classical statistics for such processes is not warranted .
    Clearly considering e as a gaussian noise what most climate “modellers” do has no theoretical or experimental foundation .
    Sure it makes the calculations easier and if by chance the choice of C is such that e is “small” and G is approximately linear for the characteristic time scales considered then it may even give results that don’t appear stupid at least as long as one stays on those scales and uses only the “right” C .
    It is worth of a particular notice that for non linear G , the relative sizes of a and e do NOT matter !

    So the first expression is obviously wrong .
    Is it useful ?
    Well this question is similar to ask if Reynolds averaging the Navier Stokes equations is useful .
    If the goal is to make a statement about Navier Stokes solutions then it isn’t . It even makes the matter worse because Reynolds averaging (which is technicaly a variable change) introduces new , unphysical variables for which there are no additional equations .
    So it basicaly makes the system unsolvable and undetermined .
    If the goal is to make a statement about a system whose properties are restricted to a very high dimensional chaos then it is useful to predict and interpret a very narrow class of fluid dynamics problems .

    However this remark doesn’t allow to tell us if the climatologic analogy of RANS (the expression 1) is useful because we neither know the domain where such a simplification would stay valid nor are we able to prove that e which is the sum of all unknown processes is stochastic .

    Seing the large qualitative difference between the expression 1 and 2 , it is probable that research strategies based on expression like 1) lead nowhere perhaps with the exception of some very restricted circumstances that have not been yet identified .

  79. Re: #84

    Dan Hughes,

    I’m afraid Gavin apparently will not respond to any questions or comments about validation or verification. This is truely a sad state of affairs, but based on my observation of the NASA GISS, not surprising. It appears that certain members of the climiate modeling community wish to spend more time blogging than doing useful work like, say, properly documenting their software. Alas, it would be difficult to validate a code like model E anyhow since nowhere are the differential equations and parametric models documented in a systematic and orderly fashion. Yet, we can count on the output from these undocumented, unverified codes to generate more “press releases” about our future climate…

  80. I might be naive (and late since the last entry was 3 days ago) but could anyone comment one the intrinsic unreliability of these kind of models for long times?

    I know very little about climate models being involved in biological models but the general rule I always put forward is that if a model is built on strongly coupled differential equations over which some parameter optimization (but some times it’s a pure estimation/guess) is also done then the chances that there will be no deterministic chaos are close to zero. Lyapunov exponents, for example, is an analysis I always do to estimate the reliability for long time.

    Anyone can shed some light on this?

    Also, I have seen plenty of models that might show good agreement on certain dimensions but awful for others (still in human biology). So I would like to know why should I accept a model that is only partially “right”? After all, all the outcomes are interconnected (if the equations are coupled) and I cannot accept that by “feeding” one or more equations with variables blatantly wrong I get some outcomes that are right. This can happen if 1) the optimization is “ad hoc”, 2) the multidimensionality of the system is not considered relevant (all variables are equal but some are more equal than others), 3) the equations are wrong, 4) the coupling of the equations is improper, 5) the scale up/down of the different domains is improper.

    Finally, being Italian, a little trivia: the famous quote “Eppur si muove” (and nevertheless/yet/still, it does moves) was never pronounced by Galileo. And even if it’s true you cannot use it as an endorsement of your model. At least I don’t believe so.

    Marco

  81. I am in public policy, with some experience in transportation modeling a few years back, and all I have to offer is that after reading the post and comments it is quite clear that the state of climate modeling is not sufficient to justify the policy burdens that some people want them to carry: policies that would cost many billions of dollars per year in immediate costs and trillions of dollars in lost economic growth across the world. And per comment #73, it should indeed be incumbent upon the climate modeling and research community to be more forthright about what their work does and does not demonstrate and at what level of confidence, something that many of them have been derelict in doing.

    It seems to me that in a policy sense, the climate change argument is driven by little more than a group many of whose members think their area of interest/expertise is the axis of all history and creation. They are hardly alone in this, but they are just as wrong as all the other specialties that see themselves as the most important thing in the world.

  82. This smacks of hanging onto a tiger by it’s tail. The word ?projection? implies a mathematical concept, ie ?If x then Y will be the outcome?. A prediction is a prophecy; “and so it will come to pass”.
    Unfortunately whether intended or not, the latter is how the “projection scenarios” have been construed by interested parties who subsequently pass on the message. . My viewpoint is that it is clear that modellers have been well aware of this misconception, and have now a long overdue duty of care to the public to make this clear distinction. Such science carries with it responsibility and I therefore take this to mean that the confusion of these terms is either deliberate or in some way convenient to those crunching the numbers. To divorce oneself from the consequences of one?s work whatever field could be described as a callus act. As I suspect that Gavin is not callous I am left to conclude that the models have become ?useful? in political terms only. They have not apparently advanced the science in a way that anyone has yet defined. If you can?t define how useful something is then I would guess it?s not useful until you can describe its use. Would you buy a power toolif you didn?t know what it was or how powerful it was?

    Until then Gavin is there a way that better care might be taken of how the models are used?

  83. actually, I went and got the data gavin pointed me to. There are some minor annoyances ( data formating) but you can go
    check the hindcast of his model. In certain regimes it was “ok”
    to use an advanced statistical term, in other parts it was less ok.

    I think when I checked ModelE hindcast versus Hadcru Observations, the hindcast wasnt out of the +- 3sig zone very often. there were phenomena it wasnt to good at.

    the warming in the 30s, and the elnino spike of 98. the latter is less troublesome than the former. But all in all it wasnt that bad.

  84. Re: Gavin #86:

    As someone who tries to maintain a balanced approach in the AGW debate, replies like Gavin (Schmidt?)’s in #86 are quite revealing. First, he impugns Prof. Douglass’s honesty, then he makes a snarky, straw man reply to Douglass’s comprehensive reply.

    “I think we both agree that there are substantial structural uncertainties in the radiosonde data that are relevant to any model-data comparison.”

    No, they didn’t agree on that, even at a minimum. Douglass just explained why he used version 1.2 rather than 1.3 and 1.4, which WERE subject to specific problems regarding the ERA-40. Gavin then goes on to pen another unprofessional, snarky addendum claiming Douglass’s response, although clearly insufficient, was “better late than never”. Scientists used to be above this kind of conduct, at least most respected scientists were.

    Since when has scientific debate descended to “gotcha” games on internet blogs? Sure sounds like Gavin has a lot to hide by his rude, defensive tactics.

  85. hey Briggs you might want to wander over to CA and check
    out the debate about models under the troposphere temp
    thread

  86. William,

    Was going to say exactly the same as #94 but Mosher beat me to it.

    As a non statistician, I would be interested in your take on the argument Beaker has been making on the CA threads

  87. #84 Dan, you have been raising a critical point and should write an article for public consumption about the issue of verification and validation of climate models.

    Papers in the climate science literature showing the complete unreliability of climate models are typically ignored by the AGW crowd and this fact rarely find its way into public view.

    But people deserve to know because they will pay the bills. Indeed, they pay the bills of all the research, too; the research that is bugled forth, and the research that is quietly hidden away.

    So please, Dan, make the effort and write the article. Publish it in a magazine like Skeptical Inquirer where the educated public will see it, and where the editors are publicly committed to a dispassionate and principled honesty. You’re the guy to do it.

    Here’s my own effort: http://www.skeptic.com/the_magazine/featured_articles/v14n01_climate_of_belief.html

  88. #93 — MrCPhysics wrote, “Scientists used to be above this kind of conduct, at least most respected scientists were.”

    Well-expressed, and I sure share this sentiment. Climate science has been subverted by partisanship, and an often-vile polemic has displaced honest debate. The scary part of it is that scientists themselves (physicists!) have willingly participated in this corruption of science.

  89. I was interested in Gavin’s choice of null hypothesis a while back in this thread: His null hypothesis is unchanging internal climate variability, explained more fully as: “Whether ?a? is stable or not is undetermined. We know that it is a function of the current climate and current feedbacks. In a much warmer or cooler world that might well be different. But for relatively short periods I think we can assume that it doesn?t vary dramatically – it would be very difficult to tell on a practical basis in any case.”

    I’m fascinated by this assumption that non-AGW climate change is so invariant in the short term. In light of historic realities ranging from last year’s cooldown to past glaciations involve rather rapid change, what basis is there to presume such baseline stability?

  90. Does anyone here DEFEND the use of the SE of the mean? (Not deflect to other issues, but actually say this is the right way to assess the spread of a distribution? That they would go use this practice in polling and biology and all sorts of other similar tests? Will Briggs speak out on this?

    The models have a lot of issues with them. Unpublished code. Poor documentation and validation. Poor calibration, testing and future testing predictions. I think they are in some ways wonders of nature in how much time, money and work have goine into them. My gut still says, nah…they don’t help much. they remind me of the computer in Blind Lake by wilson that actually starts dreaming (it’s a fancy telecsope signal to noise integrator that starts working even with no signal.)

    I think AGW is still likely given two things: recent temp rise, basic physical argument of GH warming plus water vapor frommore temp.

  91. Mr. Pete
    re: 98

    Obviously the climate has a great deal of natural variability. Generally speaking, alarmist climate scientists seem to forget this and speak as if any change is driven by mankind. However, in this case, no change is probably the correct null hypothesis. I came to this opinion not because I am expert in these things but because J. Scott Armstrong uses the same null hypothesis and he is the expert in scientific forecasting. Climate scientists have shown little interest in scientific forecasting (indeed they have shown awareness of the literature) until recently. But Armstrong and Kesten Green have written an audit of the IPCC projections and found them wanting. The IPCC truly needs to incorporate the principles of scientific forecasting into their future projections.

  92. I recall a fellow graduate student in physics many years ago who could consistently predict the weather to a 66% accuracy, without any data or calculations. When asked how he did it, he replied,

    ” I have observed that weather patterns tend to run in 3-day cycles. So, each day I predict tomorrow’s weather will be the same as today. I’m generally right 2 days out of 3…”

    Thus we might hazard a guess that next years weather will closely approximate this year, within the limits of normal variability. However, who could possibly get a paper published with the title, “Mother Nature will do what she wants to do without any assistance from Humankind.”

    And of course, who could get any funding in these times for “climate science” that proposed we simply sit and watch what Mother Nature has in store?

    Funding a cure for a nonexistent disease is far more exciting than spending our tax dollars on such mundane things as world hunger or cancer research. Just follow the money trail… like crumbs on the forest floor, it will always lead you directly to the Church of Al Gore.

  93. I thought they were ‘General Circulation Models’ rather than ‘Global Climate Models’. They purport to model atmospheric circulation and thus ‘climate’.

  94. Andrew wrote:

    “BTW Gavin, I find your statement that measurements are frequently wrong rather than the models disturbing. Measurements can be wrong, but the reality is that no good scientist assumes that theory is right and measurements wrong. If we go down that road, we get epicycles and aether.”

    You go down that road you also get Copernican Astronomy which, while theoretically more elegant than its Ptolemaic counterpart (the math was easier), did not really match the observational data any better for decades.

    I wish people would not base their contributions to these discussions on old cliches about the scientific method.

  95. First a question: How “open” are the GCM? Is the source code for the models shared, discussed and dissected? Can other researchers view and analyze the source code? Is the source code public?

    Software must be tested. The only real test of a climate model is matching the output with real world data. This makes the climate models very difficult to validate as the testing period is decades. If the modelers ‘tweak’ the model during the testing period, the test must properly start over. For all practical purposes, this makes long-term climate models untestable and unverifiable.

  96. From my layman’s perspective this discussion is a useful dialog among scientists, but fails to deal with the broader issues raised by Vincent Gray, the introduction which follows:

    “APRIL 2008
    THE GLOBAL WARMING SCAM
    by Vincent Gray
    Climate Consultant
    75 Silverstream Road, Crofton Downs, Wellington 6035, New Zealand..
    Email vinmary.gray@paradise.net.nz
    April 2008

    ABSTRACT

    The Global Warming Scam has been perpetrated in order to support the Environmentalist belief that the earth is being harmed by the emission of greenhouse gases from the combustion of fossil fuels
    The Intergovernmental Panel on Climate Change, (IPCC) was set up to provide evidence for this belief. They have published four major Reports which are widely considered to have proved it to be true. This paper examines the evidence in detail and shows that none of the evidence presented confirms a relationship between emissions of greenhouse gases and any harmful effect on the climate. It is the result of 18 years of scrutiny and comment on IPCC Reports and of a study of the scientific literature associated with it.
    In order to establish a relationship between human emissions of greenhouse gases and any influence on the climate, it is necessary to solve three problems
    To determine the average temperature of the earth and show that it is increasing.
    To measure the concentrations of greenhouse gases everywhere in the atmosphere.
    To reliably predict changes in future climate.
    None of these problems has been solved
    It is impossible to measure the average surface temperature of the earth, yet the IPCC scientists try to claim that it is possible to measure ?anomalies? of this unknown quantity. An assessment of all the temperature data available, largely ignored by the IPCC, shows no evidence for overall warming, but the existence of cyclic behaviour. Recent warming was last recorded around 1950.An absence of warming for 10 years and a current downturn. suggest that the cool part of the cycle is imminent.
    The chief greenhouse gas, water vapour, is irregularly distributed, with most of it over the tropics and very little over the poles. Yet the IPCC tries to pretend it is uniformly distributed, so that its ?anomalies? can be treated as ?feedback? to a global temperature models,
    2
    Carbon dioxide is only measured in extremely restricted circumstances in order to pretend that it is ?well-mixed?. No general measurements are reported and 90,000 early measurements which show great variability have been suppressed.
    Although weather cannot be predicted more than a week or so ahead the claim is made that ?climate? can be predicted 100 years ahead. The claim is based on the development of computer models based on the ?flat earth? theory of the climate which assumes it is possible to model the climate from ?balanced? average energy quantities This assumption is absurd since all the quantities have skewed distributions with no acceptable average .No resulting model has ever been tested for its ability to predict the future. This is even admitted as the model outputs are mere ?projections?. Since the projections are far into the future, nobody living is able to check their validity.
    Since no model has been validated, they are ?evaluated? based on ?simulations? which are mere correlations or fail to agree with observations. Future ?projections?, which combine the untested models and exaggerated ?scenarios? are graded for their ?:likelihood? from the unsupported opinion of those paid to produce the models. A spurious ?probability? attached to these opinions is without mathematical or scientific justification
    Humans affect climate by changes in urban development and land use, but there is no evidence that greenhouse gas emissions are involved, except in enhancing plant growth.”

    (I am sure all of the commentators can find links to the paper.

    Your analysis and evaluations please.

  97. I don’t really understand the above comment from bigcitylib. What do you mean? That if the experimental readings do not FIT the model outcomes should be “interpreted” or “adjusted” as NASA did? What is an “OLD CLICHE”?

    Mr. “Beatdowner of the Conservative Menace” should we point out that Copernicus was suggesting an alternate theory (that Galileo used and was PROVED right only several decades later) because Venus had “lunar phases” and Jupiter had satellites that could not be explained with the geocentric model? Go back and get your history of science facts right before pairing IPCC/AGW/Gavin with Copernicus.

    A scientist creates -if he/she is able- a new theory if experimental observations contradict the current state of the art; not declare that experimental observations are frequently wrong when do not fit a theory.

    I wish people would not base their contributions to these discussions on their clearly stated political agenda. Here we talk about science.

  98. First, William Briggs, David Douglass, lucia, Dan Hughes, Gavin, and all the other posters, thank you for your participation in this discussion.

    I was struck by this comment from Gavin:

    Craig, The error in Douglass et al is that their estimate of the uncertainty in the model projections is instead the uncertainty in the determination of the mean of the model projections rather than the spread. It is exactly equivalent to throwing a dice 100 times and calculating the the mean throw to be 3.5 +/- 0.1 and than claiming that a throw of a 2 is a mismatch because 2 is more than 2?0.1 away from the mean. It?s just wrong!

    This is … well … it’s just wrong!

    The mean of a number of throws of a die is not an individual throw. It is never expected to equal an individual throw. Throw it a million times, the mean will never have a relationship with an individual throw. Accordingly, the proper measure to compare it with a single throw, as Gavin points out, is the standard deviation of the throws. But that’s not the situation with the models.

    The average of a number of models, on the other hand, is expected to equal the observed data. That is the basic assumption of the IPCC. The IPCC says :

    Multi-model ensemble approaches are already used in short-range climate forecasting (e.g., Graham et al., 1999; Krishnamurti et al., 1999; Brankovic and Palmer, 2000; Doblas-Reyes et al., 2000; Derome et al., 2001). When applied to climate change, each model in the ensemble produces a somewhat different projection and, if these represent plausible solutions to the governing equations, they may be considered as different realisations of the climate change drawn from the set of models in active use and produced with current climate knowledge. In this case, temperature is represented as T = T0 + TF + Tm + T? where TF is the deterministic forced climate change for the real system and Tm= Tf -TF is the error in the model?s simulation of this forced response. T? now also includes errors in the statistical behaviour of the simulated natural variability. The multi-model ensemble mean estimate of forced climate change is {T} = TF + {Tm} + {T?} where the natural variability again averages to zero for a large enough ensemble. To the extent that unrelated model errors tend to average out, the ensemble mean or systematic error {Tm} will be small, {T} will approach TF and the multi-model ensemble average will be a better estimate of the forced climate change of the real system than the result from a particular model.

    The observed data is not just another throw of the dice. According to the IPCC, the observed data is the value that the models are supposed to predict/project/forecast, and that their mean is supposed to “approach”. Therefore, the proper measure of the fit between models and data is the standard error of the mean.

    However, all of this is a smokescreen. The real issue is not the disagreement between the observations and the models. The issue is the disagreement between the observations and the theory.

    The fact that all of the GCMs except a couple mutants show increasing temperature trends aloft is not a coincidence. Nor is it a result of model tuning, or of real world observations. It is the result of the theory that GHGs are raising the temperature of the planet.

    IF the increase in global temperature is being caused by GHGs, it is physically necessary that there be increased, and increasing, warming as you go aloft. In a simplified version of the theory, you can imagine the greenhouse gases as being a kind of shell around the earth. This shell absorbs and radiates energy.

    Now suppose we increase the CO2. That GHG shell absorbs more energy, we can call that additional energy ∆Q. Half of that energy, ∆Q/2, is radiated out to space, and the other half goes back down to warm the earth. That?s what is called the ?greenhouse effect?.

    Notice in this process that the greenhouse gases have warmed by ∆Q ? but the surface has only been warmed by ∆Q/2. The important point is that the warming aloft is, and has to be, greater than the surface warming.

    Now in the real world, we don?t get warming aloft of two to one, there are losses, plus we have to convert ∆Q, the energy change, into ∆T, the temperature change, it?s more complex ? but at the end of the day, for the GHG driver theory to be correct, the atmosphere must warm more than the surface. It cannot warm less, because heat doesn?t run uphill, as we used to say.

    That?s the (admittedly simplified) physics behind the theory?s requirement that the rise in the middle troposphere has to have a positive sign. This is borne out by the overall trend of the majority of the models. All the models except a couple mutants show a temperature trend rise beginning immediately above the surface and continuing to rise gradually to a peak at an elevation of around 300-200 hPa. It is not a chance outcome, it is required by the underlying theory.

    This is why it is so significant that the observational data shows cooling. The sign of the change is not an inconsequential zero point that you can be a little above or a little below. It is a physical limitation based on the theory that the additional warmth is caused by the increase in GHGs. If GHGs are what is making the earth warm, the atmosphere has to warm faster than the earth.

    Now as Steve McIntyre has pointed out over on his blog, the radiosonde data has problems of its own, discontinuities, the usual difficulties. It has not been audited. We should not believe it just because we like it.

    But IF the radiosonde observations are in fact correct (and the MSU satellite data seems to confirm it) ? then GH gases are not the driver of the warming. Because if they were, we?d see positive and increasing warming trends aloft ? but we our observations show negative and decreasing trends aloft.

    Which, of course, is why we are seeing such an effort to define the question as being one of the improper use of statistics ? precisely because that?s not the point. Nor is the point dice throws and their distribution. These, along with Gavin’s egregious insult of David Douglass and Gavin’s subsequent refusal to apologize for it, are all just a variation on the time tested “Wizard of Oz” ploy … “look over there, don’t pay any attention to the man behind the curtain, look at the dice, pay attention to how one man is insulting the other man” … none of that is the issue.

    The issue is that observations of negative and decreasing trends aloft flies directly in the face of the theoretical basis of the claim that the warming is caused by GHGs.

    Is this proven? No, nothing in science ever is. Are the observational data flawed? All observational data is flawed in some way.

    However, the increase in warming rates as we go aloft is one of the very, very few falsifiable predictions of the theory that current warming is caused primarily by GHGs. As such, observational evidence showing that the theory is false deserves more than a discussion of throwing dice and unsupported accusations of scientific dishonesty …

    My best to everyone, and special thanks to William Briggs for his persevering, unwavering, and yet gentlemanly King Canute quest to stem the rising tide of innumeracy …

    w.

  99. Marco wrote:

    I don?t really understand the above comment from bigcitylib. What do you mean? That if the experimental readings do not FIT the model outcomes should be ?interpreted? or ?adjusted? as NASA did? What is an ?OLD CLICHE??

    Mr. ?Beatdowner of the Conservative Menace? should we point out that Copernicus was suggesting an alternate theory (that Galileo used and was PROVED right only several decades later) because Venus had ?lunar phases? and Jupiter had satellites that could not be explained with the geocentric model? Go back and get your history of science facts right before pairing IPCC/AGW/Gavin with Copernicus.

    1) What I mean is that it is often a rational choice (and a choice often seen in the history of science) to choose a theory over recalcitrant data.

    In the case of Copernican astronomy, I think you are wrong on several points of facts. But regardless, what they episode
    really illustrates is that for several decades the choice of theory was underdetermined by observational data. So, if the choice of had been determined on the basis of the available evidence, there would have been no compelling reason to choose Copernicus. However, I think the historical consensus is that that choice had already been made on other grounds when Gallileo came onto the scene. I can’t remember who argues that his discoveries amounted to a “clean up operation”.

    2) More generally speaking, favoring observation over theory can result in a conservative bias–a bias towards older, more established theories. New theories very seldom emerge fully formed, and therefore the preference for the new theory by its adherents cannot be explained on the grounds that fits the data better. (at least at first)

    3) As Gavin notes the choice to choose theory over data in this case proved well-founded.

    4) You write:

    “What do you mean? That if the experimental readings do not FIT the model outcomes should be ?interpreted? or ?adjusted? as NASA did? What is an ?OLD CLICHE??”

    I don’t understand this. Are you accusing Gavin of being invoived in a Great Global Warming Swindle? Cue the black helicopters and the UN blue helmets and etc. If so, no wonder he bailed on this forum. Nobody wants to argue with a Truther.

  100. I believe my opinions are just as such: opinions. But facts are not and since I have nothing better to do now I think you deserve a reply.

    You really cannot get over this obsession with conservatorism, do you? That’s why even in this second post there is not a single scientific contribution just “old cliches” good for a find-the-conservator-and-tell-him-he’s-a-looser agenda.

    In logic, and therefore science, a theory is second to experimental evidence. It actually stems from it. According to the NatAcSci: “In science, the word theory refers to a comprehensive explanation of an important feature of nature that is supported by many facts gathered over time. Theories also allow scientists to make predictions about as yet unobserved phenomena”. They do not allow scientists to declare WRONG the experimental evidence.

    I guess it all boils down to a little more humility. Even Einstein, Plank, De Broglie and all the fathers of modern quantum and relativity theories would be ready to move on to better theories theirs being proved wrong by experiments (actually this is what you always do in science: prove a theory wrong to improve it).

    And so it was for Copernicus who just stated his as a theory among others but with the plus that his was able to explain experimental observations that seem to contradict the Ptolemaic model. Again: get your history straight! Heliocentric theory was formulated because of the observational discrepancies with the commonly accepted theory. Nobody could see that the Earth was rotating around the Sun but they could see that Jupiter had satellites and therefore not everything was revolving around the Sun. And this was “unacceptable” in the Ptolemaic model. This is because a theory has to be based on some direct or indirect clues: you cannot make inferences on the unseen if you have not at least a minimal clue of its presence.

    For your reference you also think wrong about the historical consensus: the first experimental demonstration of the Copernican model occurred in 1851 thanks to physicist L?on Foucault and his pendulum. Note that Copernicus formulated his theory in 1543.

    But the bottom line, once again, is that you are committing an even greater sin of hubris than Galileo did. You (IPCC/Gavin/Gore/…) have all the experimental data that say that your model/theory is wrong. Galileo at least had the observations on his side.

    If your logic assumes Gavin as a Truther (Capital “T”? who is he? God?) then all your outcomes cannot even be argued. Here we are discussing specifically about AWG/IPCC/him being wrong. And that’s why, in my opinion, he bailed on this forum once he found that could not satisfactorily answer to the pertinent questions posed.
    And FYI, GAVIN IS THE ONE WHO ACCUSE A FELLOW SCIENTIST OF DATA TAMPERING WITH NO PROVES!! AND NEVER APOLOGIZED.
    Shame on this attitude that is not scientific but political.

    Good bye.

  101. Marco, as you say, Copernicus wrote in 1543. Gallileo discovered the Gallilean satellites in 1610.
    You are claiming that the latter drove the formulation of the former. I didn’t know Galileo discovered the time machine as well as all that other stuff!

  102. Even earlier that him, Aristarchus of Samos proposed the model. But Copernicus re-proposed it since:

    1. While the Ptolemaic model was very good at predicting the positions of the planets, it wasn’t precise, and over the centuries its predictions got worse and worse.

    2. He didn’t like the fact that the Ptolemaic model had big epicycles to explain the retrograde motions of the planets. He knew that this could be explained instead by having the Earth also moving around the Sun.

    These alone are indirect experimental evidences that the Pt. model was wrong. Exactly as the AGW/IPCC/Gavin/Gore models collect problems over problems. The difference is that nobody proposed multi-billion dollar solutions to fix the worsening predictions of a wrong model. Rather Copernicus+Galileo+Kepler+ other scientists worked out a better model (that is ultimately still wrong).

    And while the Pt. theory got the model almost right the AGW/IPCC/Gavin/Gore is really in trouble since the experimental data do not agree. And if there are no experimental evidences all this is, as we say in Italy, “fried air”.

    So if you really insist down this road you are making more and more evident the similarities of Ptolemaic model with the AGW/IPCC/Gavin/Gore model.

    But, again, here we do not want to discuss about history or philosophy. Only science. And, for the third time you did not contribute a bit. So I am done.

  103. This is a very interesting discussion. In #55 Gavin talks about e(t) which describes the radiative component of the heat transfer. What about the convective and conductive components which would be greatly affected by turbulence? Is it justifiable to describe these effects simply as noise and hence affecting weather as opposed to climate? After all the average global temperature should depend on the difference between the energy coming in (from the sun) and energy going out and this should depend on all heat transfer phenomena. Would someone care to comment on the last paragraph in the attached essay by Prof. Tennekes? http://www.sepp.org/Archive/NewSEPP/Climate%20models-Tennekes.htm

  104. Willis,

    Thank you for your post. There has to be some ‘bottom line’, and I agree with your analysis. Well done and well said!

  105. In 113 above I meant to say a(t) rather than e(t). Also Tom Vonk in #87 seems to be saying things that are similar to the points made by Prof. Tennekes. I wonder if he would care to elaborate further.

  106. “This is nothing like the case (at most people play with maybe half a dozen)”

    Modelling choices alone give you hundreds of parameters. Let me name a few parameters you can tweak right off the bat, dx, dy, dz, dt. That’s four right there. You only have two left. Let me name some other things you can tweak: choice of boundary conditions, handling of initial conditions. There are probably hundred ways of just deciding on grid schemes.

    I have had done modeling with pdes (E and M simulations), matched time series models, calibrated SDE models. Never have I observed that there are only 6 parameters over which a modeler has freedom of choice. In general the more complicated model, the more choices a modeler is able to make. Unless climate modelling is some how different, a hundred choices is probably correct. 6 choices for modelling the whole climate is absurd. Electromagnetics is a very well studied, accurate, well-tested branch of physics. Even in very simple electromagnetic simulations there are a huge number of different modelling choices that can be made. I have a hard time understanding why this would not be the case in climate modelling.

  107. How can anybody ever think Real Climate is run by professionals? They remain AGW hack activists to the core. I tired posting a response at RC and it was deleted:

    This from a CA post:

    The Information Ministers are back at it.

    http://www.realclimate.org/index.php/ar … /#more-562

    Note the subtle accusation of fraud again:

    We mentioned the RAOBCORE project at the time and noted the big difference using their version 1.4 vs 1.2 made to the comparison (a difference nowhere mentioned in Douglass et al’s original accepted paper which only reported on v1.2 despite them being aware of the issue)

    my bold

    However, Gavin may have a mild case of short term memory loss because this was discussed at Briggs blog recently.
    https://www.wmbriggs.com/blog/2008/04/08/why … -exciting/

    Search for Gavin’s posts, then find David Douglass responses.

    Anyone want to bet this gentle reminder wouldn’t see the light of day at RC?

  108. Sorry to arrive so late – I’ve only just found this interesting and informative blog.

    I’m a complete layman, so would this be a reasonable simplification of what this averaging of GCM ensembles achieves?

    We want to find a particular arithmetic calculation that yields the answer 10. We don’t know what the calculation is, but we know the answer: 10.

    So we devise lots of calculations, and constrain them to only ever produce results in the range 1 to 20. We take all these results and average them, arriving at a mean of, let’s say, 10.2.

    We now have an anwser that is very close to being correct, but we’re actually none the wiser regarding the calculation that we originally sought.

    Is this anything like what is happening?

Leave a Reply

Your email address will not be published. Required fields are marked *