The Answer to Senn will continue on Monday. Look for my Finger Lakes winery tour tasting notes Sunday!
Several readers asked me to comment on an ensemble climate forecasting post over at Anthony’s place, written by Robert G. Brown. Truthfully, quite truthfully, I’d rather not. I am sicker of climate statistics than I am about dice probabilities. But…
I agree with very little of Brown’s interpretation of statistics. The gentleman takes too literally the language of classical, frequentist statistics, and this leads him astray.
There is nothing wrong, statistically or practically, with using “ensemble” forecasts (averages or functions of forecasts as new forecasts). They are often in weather forecasts better than “plain” or lone-model predications. The theory on which they are based is sound (the atmosphere is sensitive to initial conditions), the statistics, while imperfect, are in the ballpark and not unreasonable.
Ignore technicalities and think of this. We have model A, written by a group at some Leviathan-funded university, model B, written by a different group at another ward of Leviathan, and so on with C, D, etc. through Z. Each of these is largely the same, but different in detail. They differ because there is no Consensus on what the best model should be. Each of these predicts temperature (for ease, suppose just one number). Whether any of these models faithfully represents the physics of the atmosphere is a different question and is addressed below (and not important here).
Let’s define the ensemble forecast as the average of A through Z. Since forecasts that give an idea of uncertainty are better than forecasts which don’t, our ensemble forecast will use the spread of these models as an idea of the uncertainty.
We can go further and say that our uncertainty in the future temperature will be quantified by (say) a normal distribution1, which needs a central and a spread parameter. We’ll let the ensemble mean equal the central parameter and let the standard deviation of the ensemble equal the spread parameter.
This is an operational definition of a forecast. It is sane and comprehensible. The central parameter is not an estimate: we say it equals the ensemble mean. Same with the spread parameter: it is we who say what it is.
There is no “true” value of these parameters, which is why there are no estimates. Strike that: in one sense—perfection—there is a true value of the spread parameter, which is 0, and a true value of the central parameter, which is whatever (exactly) the temperature will be. But since we do not know the temperature in advance, there is no point to talking about “true” values.
Since there aren’t any “true” values (except in that degenerate sense), there are no estimates. Thus we have no interest in “independent and identically distributed models”, or in “random” or “uncorrelated samples” or any of that gobbledygook. There is no “abuse”, “horrendous” or otherwise, in the creation of this (potentially useful) forecast.
Listen: I could forecast tomorrow’s high temperature (quantify my uncertainty in its value) at Central Park with a normal with parameters 15o C (central) and 8o C (spread) every day forever. Just as you could thump your chest and say, every day from now until the Trump of Doom, the maximum will be 17o C (which is equivalent to central 17o C and spread 0o C).
Okay, so we have three forecasts in contention: the ensemble/normal, my unvarying normal, and your rigid normal. Who’s is better?
I don’t know, and neither do you.
It’s likely yours stinks, given our knowledge of past high temperatures (they aren’t always 17o C). But this isn’t proof it stinks. We’d have to wait until actual temperatures came in to say so. My forecast is not likely much better. It acknowledges more uncertainty than yours, but it’s still inflexible.
The ensemble will probably be best. It might be, as is usually the case with ensemble forecasts, that it will evince a steady bias: say it’s on average hot by 2o C. And it might be that the spread of the ensemble is too narrow; that is, the forecast will not be calibrated (calibration has several dimensions, none of which I will discuss today; look up my pal Tilmann Gneiting’s paper on the subject).
Bias and too-narrow spread are common failings of ensemble forecasts, but these can be fixed in the sense that the ensembles themselves go into a process which attempts a correction based on past performance and which outputs (something like) another normal distribution with modified parameters. Don’t sniff at this: this kind of correction is applied all the time to weather forecasts (it’s called MOS).
Now, are the original or adjusted ensemble forecasts any good? If so, then the models are probably getting the physics right. If not, then not. We have to check: do the validation and apply some proper score to them. Only that would tell us. We cannot, in any way, say they are wrong before we do the checking. They are certainly not wrong because they are ensemble forecasts. They could only be wrong if they fail to match reality. (The forecasts Roy S. had up a week or so ago didn’t look like they did too well, but I only glanced at his picture.)
Conclusion: ensemble forecasts are fine, even desirable since they acknowledge up front the uncertainty in the forecasts. Anything that gives a nod to chaos is a good thing.
Update Although it is true ensemble forecasting makes sense, I do NOT claim that they do well in practice for climate models. I also dispute the notion that we have to act before we are able to verify the models. That’s nuts. If that logic held, then we would have to act on any bizarre notion that took our fancy as long as we perceived it might be a big enough threat.
Come to think of it, that’s how politicians gain power.
Update I weep at the difficulty of explaining things. I’ve seen comments about this post on other sites. A few understand what I said, others—who I suspect want Brown to be right but aren’t bothering to be careful about the matter—did not. Don’t bother denying it. So many people say things like, “I don’t understand Brown, but I’m going to frame his post.” Good grief.
There are two separate matters here. Keep them that way.
ONE Do ensemble forecast make statistical sense? Yes. Yes, they do. Of course they do. There is nothing in the world wrong with them. It does NOT matter whether the object of the forecast is chaotic, complex, physical, emotional, anything. All that gibberish about “random samples of models” or whatever is meaningless. There will be no “b****-slapping” anybody. (And don’t forget ensembles were invented to acknowledge the chaotic nature of the atmosphere, as I said above.)
Forecasts are statements of uncertainty. Since we do not know the future state of the atmosphere, it is fine to say “I am uncertain about it.” We might even attach a number to this uncertainty. Why not? I saw somebody say something like “It’s wrong to say our uncertainty is 95% because the atmosphere is chaotic.” That’s as wrong as when a rabid progressive says, “There is no truth.”
TWO Are the ensemble models used in climate forecasts any good? They don’t seem to be; not for longer-range predictions (and don’t forget that ensembles can have just one member). Some climate model forecasts—those for a few months ahead—seem to have skill, i.e. they are good. Why deny the obvious? The multi-year ones look like they’re too hot.
If that’s so, that means when a fervent climatologists says, “The probability the global temperature will increase by 1 degree C over the next five years is 95%” he is making a statement which is too sure of itself. But that he can make such a statement—that it makes statistical sense to do so—is certain.
If you don’t believe this, you’re not thinking straight. After all, do you not believe yourself that the climatologist is too certain? If so, then you are equivalently making a statement of uncertainty about the future atmosphere. Even saying, “Nobody knows” is making a statement of uncertainty.
See the notes below this line and in my comments to others in the text.
——————————————————————————————-
1I pick the normal because of its ubiquity, not its appropriateness. Also, probability is not a real physical thing but a measure of uncertainty. Thus nothing—as in no thing—is “normally distributed”. Rather we quantify our uncertainty in the value of a thing with a normal. We say, “Given for the sake of argument that uncertainty in this thing is quantified by a normal, with this and that value of the central and spread parameter, the probability the thing equals X is 0.”
Little joke there. The probability of the thing equaling any—as in any—value is always and forevermore 0 for any normal. Normal distributions are weird.
One (potential) problem with averaging the models is that a straight forward average assigns the same weight to each model. A really poor model is given the same credence as a not-so-bad one. That might be fine if most of them are in the ballpark but it seems no one has any idea. I guess the real problem might be that the model predictions aren’t actually checked — or so it seems.
DAV,
Of course, in actual ensembles constituent models can be and often are weighted differently.
But if you don’t know how well they perform what would be the basis for assigning weights?
DAV,
Just as I say above: based on past performance.
I have gotten the impression that the climate models aren’t verified by prediction. If they are, then the models get changed and you’re back at square one. OTOH, maybe “hindcast” really is a leave-one-out approach. If not, then it would seem they are weighted by how ell they fit the input data.
DAV: you can weight each model based on past performance (I.e. it’s error against the training/historical data). Of course, at that point, you’re not so much interested in why the ensemble-model works and are more interested in how well it works. 🙂
DAV,
The problem whether climate models are verified is separate from whether making ensemble forecasts from them is reasonable.
But that they do not verify well is a reason to distrust the theory which underlies the climate models.
Will,
if you don’t use data not present in the training, you are only getting goodness of fit which is kind of circular. Good way to fool yourself.
Briggs, I totally agree. But what’s done for weather forecasting doesn’t seem to apply to climate models. The weather forecasts seem pretty good; the climate models not so much. If no real attempt is made to verify them against observations (only saying it doesn’t seem to be so) then averaging the models is rather pointless.
Briggs,
although Brown appears to make some unreasonable demands on statistics of ensembles, he does make a valid point in my view about the meaning of the multi-model ensemble statistics (I suppose referring to e.g. the spreads as in the graph in the AR5 draft). You write that it represents “our uncertainty”.
I think, that is true in the same sense as every subjective prior represents “uncertainty”. Which is essentially meaningless. Or, as I commented elsewhere, it is sociology of the climate modeling community, not physical science (All groups have access to the same observational data and to the same published literature on processes, parameterisations and numerical methods. With the data of the multi-model ensemble, one could study things like: What choices did they make? What motivated them to make these choices? How much did they interact, and how? etc. etc.)
Brown claimed that instead of just publishing the spread of the ensemble of “all models” (within a certain group) as in the AR5 draft, what should be done is pull them through empirical testing (of various relevant aspects) and try to weed out the ones showing inferior performance. As you suggest here as well Briggs. I think that Brown was making a good point here. Would you agree?
Averaging different runs of the same model, with a range of input values, to model the uncertainties in the input values, is a good idea.
But averaging different models, much less so. There is less pressure on the model builders to make their model better. And it much better to throw away bad data than to add it with a very low weight. You can add so much bad models that their bad predictions can overwhelm the good predictions from the good models.
While I can see the rationality of the practice, it is the rationality of the lesser of two evils.
DAV: the weighting is done as part of the model building process; the ensemble in this case IS the model. No fooling.
You will only know how we’ll it works after the fact, as is the case with ALL models.
AOGCMs are not just “any” model, they are supposed to be applying the laws of physics and chemistry to our climate. One can get away with tweaking the parameters in models that produce weather forecasts, because observational experience has conclusively demonstrated how forecasting models are biased. We can’t wait for climate models to be refined by centuries of experience with changing levels of GHGs. Policymakers are relying on these models to represent the RANGE of possible future climates that are consistent with known physics and chemistry. The IPCC’s collection of national models (chosen for political reasons and interpreted using “model democracy”) is called an “ensemble of opportunity” (AR WG1 Section 10.1), because it makes no attempt to explore the full range of model parameters that are consistent with the physics of cloud formation and heat transfer. They acknowledge that statistical analysis of their ensemble is problematic.
“Many of the figures in Chapter 10 are based on the mean and spread of the multi-model ensemble of comprehensive AOGCMs. The reason to focus on the multi-model mean is that averages across structurally different models empirically show better large-scale agreement with observations, because individual model biases tend to cancel (see Chapter 8). The expanded use of multi-model ensembles of projections of future climate change therefore provides higher quality and more quantitative climate change information compared to the TAR. Even though the ability to simulate present-day mean climate and variability, as well as observed trends, differs across models, no weighting of individual models is applied in calculating the mean. Since the ensemble is strictly an ‘ensemble of opportunity’, without sampling protocol, the spread of models does not necessarily span the full possible range of uncertainty, and a statistical interpretation of the model spread is therefore problematic. However, attempts are made to quantify uncertainty throughout the chapter based on various other lines of evidence, including perturbed physics ensembles specifically designed to study uncertainty within one model framework, and Bayesian methods using observational constraints.”
Will,
I assumed that was implied. Often models that give the best fit are the worst predictors (called ‘over fit’). If you average individual predictors without regard to the predictive power of each and weight them according to goodness of fit then you could end up with the predictive power of the worst predictor.
Frank,
We can’t wait for climate models to be refined by centuries of experience with changing levels of GHGs
Really? Why is that?
Policymakers are relying on these models to represent the RANGE of possible future climates that are consistent with known physics and chemistry.
To do what, exactly? Are they making preparations or just looking for a revenue source?
Clearly a model could be created using a lower effective climate sensitivity. And that model would likely match observations better over the past two decades.
It would be, of course, meaningless to declare this model “better” until it has shown better skill in forecasting going forward.
The real sin here is that there is no evidence this type of low sensitivity model is being built today, and there appears to be a great deal of political pressure to prohibit them from being built and analyzed going forward.
Many are deathly afraid of reading a headline “Low climate sensitivity models outperform high climate sensitivity models over the past two decades”. One avoids this scenario by never allowing the comparison to occur.
DAV: The questions you posed have nothing to do with the subject of this post. If you really don’t understand why we can’t wait centuries for observational validation of climate models, continued discussion is unlikely to be profitable.
Frank,
Sounds like some people only need a model for a sciencey veneer. If you can’t wait for validation, why bother with it at all?
How do we validate a model for the maximum load a bridge can carry? Are we required to validate our model by building a dozen identical bridges and increasing the load on them until they break? And then build one more for the public to use? We’ve found better ways that don’t rely on purely empirical methods.
We don’t necessarily need to conduct experiments with the composition of the earth’s atmosphere to predict how changes will effect our climate. The behavior and properties of the atmosphere have been studied in the laboratory and that information can be reliably used in climate models. Unfortunately, some of the needed parameters aren’t precisely known. Some aspects (convection, cloud microphysics) can’t be modeled with the precision and need approximate solutions. By selecting only a few models (with a single value for parameters that aren’t accurately known), the IPCC is misleading policymakers about the full range of possible futures that are compatible with well-established science.
Sciencey veneer? No. Inappropriate politicization of science? Absolutely
“Clearly a model could be created using a lower effective climate sensitivity.”
How?
Climate sensitivity is not an input parameter.
How do we validate a model for the maximum load a bridge can carry?
Engineering projects DO undergo validation, Frank, but the actual answer to your question is: by over design and incorporating safety margins. Even then, some have fallen down but usually for other reasons than improper load calculations. There is a litany of engineering failures when projects have strayed: e.g., the infamous Tacoma Narrows bridge project from the last century is held up as an object lesson in inadequate design.
Not to mention bridge design is a very old art. Climate science is in its infancy. World climate is poorly understood and the current GCMs are simplifications of that.
Until a statistical model is validated it is merely a pretty toy. Why would you want to base your actions on what a toy tells you — particularly an infant’s toy? To do so is, indeed, using it as a veneer.
Models can never be correct if they continue to assume that knowing radiative flux we can somehow determine surface temperatures, completely disregarding non-radiative processes which remove two-thirds of the energy which transfers from the surface to the atmosphere before radiation does the rest. Because of this the surface acts nothing like a blackbody.
Even as of today, Principia Scientific International is still publishing an article “The Anthropogenic Global Warming Controversy” which refers to an article by Claes Johnson in which Claes quite incorrectly describes how thermal energy moves downwards in an atmosphere. I have added four comments pointing out the error, and written to Claes (copy John O’Sullivan) pointing out the error. The last of my comments on the PSI thread sums it up, and it’s worth repeating here …
The Second Law of Thermodynamics states that thermodynamic equilibrium will evolve spontaneously. In a gravitational field this thermodynamic equilibrium (with greatest accessible entropy) is isentropic. Hence, disregarding chemical and phase changes, the total of the gravitational potential energy and kinetic energy in any small region (even a few pictograms of the atmosphere) will tend towards homogeneity at all altitudes in calm conditions. This can happen by diffusion (conduction between molecules) without any convection. Because PE varies, so will KE, and thus there will be an autonomous temperature gradient.
Thermal energy flows over a sloping temperature plane in a gravitational field in all accessible directions away from any source of new energy which disturbs thermodynamic equilibrium. That, in effect, is what the Second Law says will happen. This is how the base of the troposphere stays warm and supports the surface temperature.
In summary, PSI (and Claes Johnson) are right in saying what I say in my “Radiated Energy” paper of March 2012 about radiation from a cooler blackbody not transferring thermal energy to a warmer blackbody. But they are wrong in endorsing an article such as today’s, which cites what Claes Johnson has said about non-radiative heat transfers in planetary atmospheres.
“How do we validate a model for the maximum load a bridge can carry?”
Engineers break a bridge down into individual components (column, beams, girders, footing, etc.) that can be easily tested. The designs are also done in such a way that a failure of one bolt or beam will not bring the entire bridge (or building) down. Engineers don’t try and model the “world” as do the high priests of climatology – Engineers break big problems down into small manageable problems.
Ian,
The key word here is “effective” sensitivity. It is the theoretical positive feedbacks of CO2 that provide the amplifying affect that gets us from the laboratory ~1C warming from carbon alone to the higher numbers.
One can envision models that have lower feedback parameters. The implied assertion that models cannot be tuned for lower sensitivity makes no sense.
“How do we validate a model for the maximum load a bridge can carry?â€
They build actual bridges and test them, obviously not to destruction, but to spec. That’s the only way to really know it works. There is also destructive testing of pieces of the bridge and specifications for all the components of the bridge that are tested.
In design they do their best with the imperfect tools they have. As the bridges built using these tools are verified, more trust is placed in the design tools.
I can tell you that no engineer is going to trust a brand new tool using never before tested methods. Sometimes using untested tools is necessity, in which case a lot more testing is done to validate as much of the tool as possible.
But you need to build bridges…
“We can go further and say that our uncertainty in the future temperature will be quantified by (say) a normal distribution1, which needs a central and a spread parameter.”
Is there a statistical test(s) that can be performed on the temperature forecasts of (say) 73 models to see how close they come to a normal distribution? Would it matter if they were/weren’t close to being normally distributed?
Frank on 20 June 2013 at 1:20 pm said: “DAV: The questions you posed have nothing to do with the subject of this post.”
To me, DAV’s questions don’t seem any more OT than the statements that provoked them. 🙂
I presume everyone knows the famous story about the Emperor of China’s nose? That seems to me to be the ultimate ‘ensemble of models’.
Nullius in Verba,
A not exactly apt comparison: presumably climate modelers have some information about that which they are forecasting.
Gary Hladik,
No no no no no no no no no no no and, finally, no. Temperatures are not “normally distributed”. Neither is an ensemble forecast. Neither is anything. It simply makes no sense to say “X is normally distributed.” It is an empty statement.
We can say, “Our uncertainty in the value X can take is quantified by a normal distribution (with these parameters).”
Whether that proposition is “good” is checked in the same way we check whether the ensemble of forecasts is “good.” Where “good” is in the eye of the decision maker.
How do we validate a model for the maximum load a bridge can carry?
It most certainly is not done by averaging results from an ensemble of opportunity comprised of models, methods, and software that have not been subjected to independent verification and validation, and are not maintained under approved SQA plans and procedures.
It will never be done by using an average of several models and methods of any kind.
Factors of safety ( uncertainty ) will be applied to those aspects that are critical to the proper performance of the bridge. These critical aspects will likely be determined by use of sensitivity studies by a single model and method and software.
Nick Stokes is referring to you positively, how about that, Briggsy?
“presumably climate modelers have some information about that which they are forecasting.”
Do they? What’s the evidence for that?
Brown’s point is that understanding of the uncertainty in a projection can only come from verification, it *cannot* come from the mere *assertion* of a range of hypotheses/models.
You, and much of the rest of the climate establishment, are assuming that they wouldn’t have been asserted if there was no verification, and that therefore their assertion implies that there is some consequent understanding of uncertainty related to them hanging around in the background.
But even if this is the case, you still cannot determine the uncertainty by looking only at the model outcomes. You have to include the verification evidence somehow in the calculation.
The Chinese people estimating the length of the Emperor’s nose surely know *something* about how long people’s noses generally are. Nobody is going to estimate longer than a foot, or shorter than an inch. Their average is most certainly not without value, but it is of much less value than the sample size would indicate, and without understanding the basis of their estimates, much harder to quantify.
Briggs,
I agree with you that there is nothing wrong with using an ensemble of models for a forecast.
The difficulty with AOGCM ensembles used to make projections are nevertheless difficulties which spring from the fact that these are simultaneously used to ‘forecast’ and to ‘understand physics’. Among them:
1) On the one hand, the mean of the ensemble and its dispersion can be (and have been) communicated as forecasts. On the other hand, some claim they aren’t forecasts at all and so should not be tested as forecasts. This is a problem– as one needs to decide whether they are or are not forecasts.
2) Because the individual AOGCM’s are phenomenological and each is an attempt to simulate the earth including the uncertainty that arises due to the chaotic nature of weather, some will try to identify feature of the ensemble with properties of the weather. So, some will or implied (and have claimed or implied) the spread of the ensemble with the variability of trends one arises from the chaotic nature of ‘weather’. That is, they identify the spread in the forecast for climate as description of something physical.
This is incorrect.
If one does use the ensemble as the forecasts, and the person making the forecast chooses the spread of the ensemble as the spread of their forecast, then the spread of the ensemble is the spread of the forecast. This is a forecasting choice.
The difficulty arises later when these forecasting choices are described as corresponding to estimates of physical quantities that are of interest in and of themselves. For example: a person might be genuinely interested in the decadal variability (or centennial or annual) of trends on earth.
In this case, one could consider the following two quantities:
1) The spread in decadal trends about the mean of the full ensemble in the AOGCM.
2) The average of the spread of decadal trends about the mean for each AOGCM.
But if each models claims to be physically based and constructed in a way that the dispersion of runs in that model is intended to reproduce the dispersion of possible realizations of earth ‘weather’ under a set of specified boundary conditions, the latter, (2), is a more sane estimate of the actual dispersion of decadal trends about the unknown true mean is on a single planet since — given the way AOGCM’s are constructed and implemented, the spread of any quantity of interest “X” (being it a trend or something else) is supposed to be an estimate of the spread one would see on earth given our uncertainty in initial conditions.
Meanwhile the difference in the mean over multiple runs from an individual models is supposed to be our uncertainty due to incomplete understanding of physics (or alternately due to aspects of numerical implementation including discretization errors.)
I think explaining the difference between (1) and (2) would not be required except for the fact that descriptions of model results do sometimes (and I would say often) conflate the two. Moreover, I think many weather forecasting method do not attempt to estimate the inherent variability in ‘weather’ due to inability to specify initial conditions (And don’t do so because it’s not a quantity of interest. They are only interested in the variability to the extent that it makes forecasting uncertain.)
In this paper, ignore the hysterial alarmism about the possibility of high climate sensitivity and focus on the full range of futures that are consistent with the physical processes modeled by AOGCM when only six parameters related to clouds and precipitation are varied within limits consistent with laboratory experiments. (Observations during the satellite era are clearly incompatible with the CS much greater than 3.) The authors mention 15 more parameters that can be adjusted. The most interest parameters may be those that control diffusion of heat below the mixed layer, but this paper used a slab ocean to reduce computation.
The obvious truth is that the IPCC’s few models (which use a single value for all parameters) don’t come close to exploring the range of futures compatible known physics and chemistry. The “optimization” process that lead to the choice of a single value for each parameter is clearly suspect and subject to confirmation bias – one can clearly arrive at almost any climate sensitivity.
Uncertainty in predictions of the climate response to rising levels of greenhouse gases
D. A. Stainforth et al. NATURE VOL 433 403 (2005).
Abstract: The range of possibilities for future climate evolution needs to be taken into account when planning climate change mitigation and adaptation strategies. This requires ensembles of multi-decadal simulations to assess both chaotic climate variability and model response uncertainty. Statistical estimates of model response uncertainty, based on observations of recent climate change, admit climate sensitivities—defined as the equilibrium response of global mean temperature to doubling levels of atmospheric carbon dioxide—substantially greater than 5 K. But such strong responses are not used in ranges for future climate change because they have not been seen in general circulation models. Here we present results from the ‘climateprediction.net’ experiment, the first multi-thousand-member grand ensemble of simulations using a general circulation model and thereby explicitly resolving regional details. We find model versions as realistic as other state-of-the-art climate models but with climate sensitivities ranging from less than 2K to more than 11 K. Models with such extreme sensitivities are critical for the study of the full range of possible responses of the climate system to rising greenhouse gas levels, and for assessing the risks associated with specific targets for stabilizing these levels.
As a first step towards a probabilistic climate prediction system we have carried out a grand ensemble (an ensemble of ensembles) exploring uncertainty in a state-of-the-art model. Uncertainty in model response is investigated using a perturbed physics ensemble in which model parameters are set to alternative values considered plausible by experts in the relevant parameterization schemes. Two or three values are taken for each parameter (see Methods); simulations may have several parameters perturbed from their standard model values simultaneously. For each combination of parameter values (referred to here as a ‘model version’) an initial-condition ensemble is used, creating an ensemble of ensembles. Each individual member of this grand ensemble (referred to here as a ‘simulation’) explores the response to changing boundary conditions by including a period with doubled CO2 concentrations.
The analysis presented here uses 2,578 simulations (>100,000 simulated years), chosen to explore combinations of perturbations in six parameters.
Perturbations are made to six parameters, chosen to affect the representation of clouds and precipitation: the threshold of relative humidity for cloud formation, the cloud- to-rain conversion threshold, the cloud-to-rain conversion rate, the ice fall speed, the cloud fraction at saturation and the convection entrainment rate coefficient… As climateprediction.net continues, the experiment is exploring 21 parameters covering a wider range of processes and values.
Can we coherently predict the model’s response to multiple parameter perturbations from a small number of simulations each of which perturbs only a single parameter? The question is important because it bears on the applicability of linear optimization methods in the design and analysis of smaller ensembles. Figure 2c shows that assuming that changes in the climate feedback parameter combine linearly provides some insight, but fails in two important respects. First, combining uncertainties gives large fractional uncertainties for small predicted and hence large uncertainties for high sensitivities. This effect becomes more pronounced the greater the number of parameters perturbed. Second, this method systematically underestimates the simulated sensitivity, as shown in Fig. 2c, and consequently artificially reduces the implied likelihood of a high response. Furthermore, more than 20% of the linear predictions are more than two standard errors from the simulated sensitivities. Thus, comprehensive multiple-perturbed-parameter ensembles appear to be necessary for robust probabilistic analyses.
Can either high-end or low-end sensitivities be rejected on the basis of the model-version control climates? Fig. 2b suggests not; it illustrates the relative ability of model versions to simulate observations using a global root-mean-squared error (r.m.s.e.) normalized by the errors in the unperturbed model (see Methods). For all model versions this relative r.m.s.e. is within (or below) the range of values for other state-of-the-art models, such as those used in the second Coupled Model Inter Comparison (CMIP II) project28 (triangles). The five variables used for this comparison are each standard variables in model evaluation and inter-comparison exercises(see Methods). This lack of an observational constraint, combined with the sensitivity of the results to the way in which parameters are perturbed, means that we cannot provide an objective probability density function for simulated climate sensitivity. Nevertheless, our results demonstrate the wide range of behaviour possible within a GCM and show that high sensitivities cannot yet be neglected as they were in the headline uncertainty ranges of the IPCC Third Assessment Report (for example, the 1.4–5.8 K range for 1990 to 2100 warming). Further, they tell us about the sensitivities of our models, allowing better-informed decisions on resource allocation both for observational studies and for model development.
Lucia,
Hey, there’s a difference between < and )!^_^
Briggs on 21 June 2013 at 6:26 am said: ‘Gary Hladik,
No no no no no no no no no no no and, finally, no. Temperatures are not “normally distributedâ€. Neither is an ensemble forecast. Neither is anything. It simply makes no sense to say “X is normally distributed.†It is an empty statement.’
Hmmm. Thanks…I think.
OK, let’s see if I have at least a rudimentary intuitive understanding. Here’s a variant of the Emperor of China’s Nose:
The future Emperor has been born. Confucius suspects the child’s REAL father (nudge nudge, wink wink) is the Lord Chamberlain, a man from Cyrano Province, where everybody–including the women–looks a lot like Cyrano de Bergerac. In an attempt to confirm his suspicion, Confucius surveys the public, asking them to guess the length of the future Emperor’s nose at age 1 year, 2, 3, etc up to age 25. Since he’s pretty sure the heir is illegitimate, however, he only surveys Cyrano Province.
Now if I understand WM Briggs, it’s perfectly valid statistics to take the mean of all the guesses as a better estimate than any of the guesses, and to take the spread of the guesses as an estimate of the uncertainty of the guesses. Right?
Continuing, Confucius through bribery obtains actual measurements of the heir’s nose at ages 1 though 5 and finds that his survey mean is running considerably higher than the actual measurements. Nevertheless, the measurements, with their own degree of uncertainty, are still barely within his calculated survey uncertainty, so he still thinks the heir may be illegitimate (“Just wait for the inevitable growth spurt!”). WM Briggs, if I understand him correctly, would advise Confucius to adjust his survey protocol in light of actual results, but nevertheless approves the statistical comparison between survey and measurements. Right?
One last question: Confucius assumes his survey participants guessed on the basis of their own experience with human noses, biased though that experience may be in Cyrano Province. Suppose, however, that the participants used other methods. Several, for example, guessed based on their knowledge of elephants’ trunks. Many threw dice; some two dice, others three or more. Some used Tarot cards, some consulted gypsies in other provinces, and so on. Confucius chuckles over the elephant-like guesses, but as an expert in noses throws out all guesses outside the known limits of nose size in Cyrano Province.
Can we take the mean of these survey results as a better estimate of e heir’s nose than any one guess, and the spread of the guesses as an estimate of the uncertainty?
What Nullius in Verba https://www.wmbriggs.com/blog/?p=8394&cpage=1#comment-96070 said.
Until all aspects of all the models, methods, software and application procedures, the latter including the users, have been subjected to approved independent verification and validation, the true source of the numbers produced by the codes remains unknown.Â
The continuous domain, discrete approximation domain, numerical solution methods used for the discrete approximations domain, the coding domain, and the application domain, must first be determined to be correct for the intended areas of applications and associated system response functions.
The distributions of the calculated numbers are not a measure of the “uncertainty” of the models, methods, software and application procedure relative to physical reality until all of the requirements listed above have been completed.
Using the word “uncertainty” implies that the GCMs correctly represent physical reality. Evidence that they do not continues to appear in the peer-reviewed literature.
What is the theoretical basis for parameter estimation, essentially what is being done when parameters are tweaked, within the framework of chaotic response. Especially when it is known that the size of the discrete increment associated with numerical solutions of systems of equations that exhibit chaotic response can change the response just as the parameters that supposedly represent physical phenomena.
William, you say:
I know you’re a great statistician, and you’re one of my heroes … but with all respect, you’ve left out a couple of important priors in your rant ….
1. You assume that the results of the climate model are better than random chance.
2. You assume that the mean of the climate models is better than the individual models.
3. You assume that the climate models are “physics-based”.
As far as I know, none of these has ever been shown to be true for climate models. If they have, please provide citations.
As a result, taking an average of climate models is much like taking an average of gypsy fortunetellers … and surely you would not argue that the average and standard deviation of their forecasts is meaningful.
This is the reason that you end up saying that ensembles of models are fine, but ensembles of climate models do very poorly … an observation you make, but fail to think through to its logical conclusion. Because IF your claim is right, your claim that we can happily trust ensembles of any kind of models, then why doesn’t that apply to climate models?
All you’ve done here is say “ensembles of models are fine, but not ensembles of climate models” without explaining why ensembles of climate models are crap, when the performance of climate models was the subject and the point of Robert Brown’s post … sorry, not impressed. Let me quote from Robert’s post regarding the IPCC use of “ensembles” of climate models:
Note that he is talking about climate models, not the models used to design the Boeing 787 or the models that predicted the Higgs boson … climate models.
One underlying problem with the climate models for global average temperature, as I’ve shown, is that their output is just a lagged and resized version of the input. They are completely mechanical and stupidly simple in that regard.
And despite wildly differing inputs, climate models all produce very similar outputs … say what? The only possible way they can do that is by NOT being physics based, by NOT intersecting with reality, but by being tuned to give the same answer.
If climate worked in that bozo-simple fashion, all of your points would be right, because then, the models would actually be a representation of reality.
But obviously, climate is not that simple, that’s a child’s convenient illusion. So the clustering of the models is NOT because they are hitting somewhere around the underlying reality.
The clustering of the models occurs because they share common errors, and they are all doing the same thing—giving us a lagged, resized version of the inputs. So yes, they do cluster around a result, and the focal point of their cluster is the delusion that climate is ridiculously simple.
So if you claim that clustering of climate models actually means something, you’re not the statistician you claim to be. Because that is equivalent to saying that if out of an ensemble of ten gypsies, seven of them say you should give them your money to be blessed, that you should act on that because it’s an ensemble, and “get your money blessed” is within the standard deviation of their advice …
There are many reasons why models might cluster around a wrong answer, and you’ve not touched on those reasons in the slightest. Instead, you’ve given us your strongest assurances that we can happily trust model ensembles to give us the straight goods … except climate models.
w.
Willis,
No. You misunderstand because you (and others) are not keeping matters separate.
1. Do ensemble models make statistical sense in theory? Yes. Brown said no and wanted to slap somebody, God knows who, for believing they did and for creating a version of an ensemble forecast. He called such practice “horrendous.” Brown is wrong. What he said was false. As in not right. As is not even close to being right. As is severely, embarrassingly wrong. As in wrong in such a way that not one of his statistical statements could be repaired. As in just plain wrong. In other words, and to be precise, Brown is wrong. He has no idea of which he speaks. The passage you quote from him is wronger than Joe Biden’s hair plugs. It is wronger than Napoleon marching on Moscow. It is wronger than televised wrestling.
2. Are the ensemble climate models good? As I said originally, not for long-range predictions, but yes for very short-range ones. If Brown wants to claim long-range models are poor, even useless, then I am his brother. But if he wants to say that they do not make statistical sense, then I am his enemy. Being “good” and making “statistical sense” are different and no power in Heaven or on Earth can make them the same.
3. A model does not have to explain the physics to be good. Stop and re-read that before continuing.
A psychic could sit in her Upper East Side apartment and make climate prognostications. We could check these—applying proper scores and all that—and if her forecasts show consistent skill over many years, then we would have to admit she’s on to something. Even if she can’t describe vorticity or laminar flow. Even if her knowledge of cloud parameterization is non-existent. Even if she didn’t know how to add two numbers together.
But if her predictions were consistently awful, then we would rightly judge her knowledge of physics is wanting.
So too with climate models. Just why they are failing I’ll leave for you to say. Whether somebody “tuned” them to be wrong is not interesting to me. Because I don’t go on and on and on (and on (and on)) about why the models fail does not make me a supporter of these models. It merely means I don’t care. For me, it is enough that they are poor.
None of those three things you say I assume do I assume.
And, incidentally, “better than random chance” has no meaning.
The ensemble makes sense when one is trying to estimate uncertainty. However, to turn this around and imply that the ensemble increases certainty is the nonsense. 10 wrong answers cannot improve on the 1 right answer out of 11. the problem is that climate science has no idea if they have 1 right answer, so they try and say all answers are right, and the average is even more right.
Briggs notes:
John von Neuman observed “With four parameters I can fit an elephant and with five I can make him wiggle his trunkâ€. Mayer et al. (2008) demonstrate how to Drawing an elephant with four complex parameters and make the trunk wiggle!
The GCM’s embed many more parameters but with limited historic data to quantify them. The ensemble mean exhibit beliefs of climate modelers based on fitting that brief historic data with no validation. When the <a href=predictions of ALL 73 of the models are hotter than the realities of global temperatures since 1979 demonstrates severe collective bias. The IPCC’s statements of 90% certainty are prejudices, not based on fact validated models.
Robert Brown’s arguments suggest major limitations on the abilities of those unvalidated climate models with limited data to ever accurately predict future climate. Hindcast/forecast validated empirical models may actually do better.
The difficulty with that ensemble mean is that politicians are being pressured to redirect trillions of dollars of coerced taxpayer funds based on those unvalidated beliefs.
How do we communicate that the climate models are not fit for that purpose?
PS On Lucia’s #2, S. Fred Springer notes that the slopes of individual model runs can vary by an order of magnitude. He indicates about 400 model run years are needed to quantify the mean trend for each model.
William, Willis and others:
In my first comment above, I cited a passage from AR4 WG1, Section 10.1 explaining why it was problematic to perform statistical analyses on the IPCC’s “ensemble of opportunity”. Naturally, AR4 proceeded to ignore this problem in later sections. If one expects to the IPCC’s ensemble to represent the range of future climates that are compatible with known chemistry and physics (and the chaotic behavior they can produce), I’d call this “wronger than televised wrestling”. The parameters used to construct these models have uncertainties and the IPCC makes no effort to explore this “parameter-space”.
Willis: One could survey a few dozen financial experts about where the Dow Jones will be at the close of 2014. There would be nothing scientific about their guesses, but the mean and standard deviation of their guesses would be statistically meaningful.
William: Let’s suppose these experts have scientific models that will predict the future earnings of the thirty companies that make up the Dow. Let’s assume these models are a lot like AOGCMs and contain a lot of reliable theory and produce chaotic behavior. Today’s Dow is analogous to today’s climate. Using multiple runs with slight different starting points (to address the chaotic behavior), these experts then proceed to calculate the Dow for the end of 2014 by multiplying the projected earnings of these 30 companies by P/E ratios. To do this correctly, these experts would have to take into account the uncertainty in what the P/E ratio for each company will be at the end of 2014. This uncertainty could be addressed by randomly sampling historical P/E data, but our financial experts – imitating the IPCC modelers – don’t do it this way. Each expert’s model picks a single “optimum” P/E ratio (within the historical range) for each company, so that their predicted earnings for the past several years times those 30 P/E ratios provides a predicted Dow that closely matches the actual Dow over the past few years. These P/E ratios are analogous to the parameters used by each model; laboratory experiments (analogous to historical P/E ratios) have determined a likely range for each parameter. And by systematically ignoring the true uncertainty in P/E ratios, our panel of financial experts with their models are going to underestimate the full range of uncertainty for the Dow at the close of 2014. Even worse, if the modeler is a bull or a bear, he can pick P/E ratios (parameters) that predict a higher or low Dow at the close of 2014. And the chaotic behavior of the Dow (and climate) hides the systematic errors.
William, thank you for your answer. You say:
Now that is what I can only term a brilliantly researched, unshakably logical, and well cited and referenced rebuttal of the nuanced statistical and theoretical issues that Robert Brown and I have raised …
You sure you want to make that your final answer?
…
If we could move on from your somewhat unsettling example of “argument by assertion”, I would say that one fundamental problem in this discussion is that Robert and I are not talking about computer models in general, as you seem to think. We are talking about climate models. Not the models that Boeing uses to model its airplanes. Not the physicists’ models of the Higgs boson. Not theoretical imaginary models of any kind. The subject under discussion is climate models, those mystical creatures used by the IPCC.
There’s an oddity about the climate models and their predictions of global temperature change. The model results, either individually or as a group, can be emulated with 99% fidelity by a single one-line equation with two parameters—climate sensitivity and a time constant. The only difference between the results of climate model is the two parameters have somewhat different values.
As a result, it is not surprising that the climate models give similar results— why wouldn’t they, when they are all running the same dang equation?
And given that they are not separate views of reality but just variations on a theme, and given that functionally they are all just spitting out the results of the same bozo-simple equation with slightly different parameters, and given that they are in no sense independent, then no, William, I’m sorry, but their mean and standard deviation is not a valid guide to anything.
THAT is the situation Robert is talking about, not some ensemble of imagined physics-based or other kind of computer models as you seem to think. What is under discussion are GCMs, massive iterative global climate models.
You say we are not “keeping matters separate”, meaning keeping “all models” separate from “climate models”. But all Robert and I are talking about are climate models.
So let me clarify the misunderstanding. You have taken Robert’s comments on climate models, thought that he was referring to “all models”, and shown his comments is incorrect for all models … which is good, true, and valuable, but wasn’t what Robert said. He (and I) are talking about the GCMs.
Oh … you mean short-range like for decadal predictions? Or perhaps short-range means predictions for next year? If you have evidence that the GCMs are good for either one, then please produce it. I’ve never seen that evidence.
However, if by “very short-range” you mean a few days or a week, then that’s a weather model … anyhow, if you’re going to claim it, then bring on some evidence for the short-range validity of the models, please.
Perhaps we have a different idea of what “makes statistical sense”. I have shown that functionally, the global temperature results of the various climate models are all reproducible, individually or as a group average, by changing two parameters in a one line equation. To me, taking the average of two dozen variations of a couple of parameters in a one-line equation doesn’t make statistical sense. Why not?
Because they make no pretext of exploring the physical climate space. They are in no sense independent. They are all just minor variations on the same equation, so of course they give similar answers.
And by the same token, the fact that functionally the global temperature outputs of the various climate models are all just minor parameter variations on a one-line equation also means that measuring the spread of their results is not meaningful. It reveals nothing but the statistical deviation (or perhaps the scientific deviation) of the understandings, prejudices, and errors of the programmers … and that’s not of any use.
So I’d say, as Robert Brown says, that measuring the mean and the standard deviation of climate model results makes no sense at all.
I’ve read it three times, and I still find it baffling. I’m sure you have a point, but I don’t see it. Your brevity is working against you.
It’s not clear, for example, what you mean by “a model does not have to explain the physics …”. I was unaware that a model was supposed to be a physics teacher, so your meaning is mysterious. Typically, the job of a model is not to explain the physics, but to embody the physics. That’s why the modelers say (incorrectly) that their models are “physics-based”. (To be fair, I suppose they are physics-based, in the same sense that a Hollywood movie is “based on a true story” … but I digress.)
It’s also not clear what you mean when you say a model is “good”. Does “good” to you mean accurate? Or does good mean precise? Or understandable? Robust? Mathematically demonstrable? Validated and verified? Well documented? Parameter-free? Fault tolerant? We have no details on what you might think a “good” model might be.
I have followed your instructions, and I just re-read your statement once again. It still is far too vague to be understandable. Why should a model explain the physics, and what makes a model “good” on your planet?
Since the climate models have in general not been tested with any rigor, and when tested have failed either spectacularly or simply mundanely, I fail to see the relevance of an example of someone who shows “consistent skill” at forecasting. What is an example of “consistent skill” even doing in a discussion of climate models, which to date have shown no skill at all? I don’t get your point.
Finally, your argument would have more weight if you could give a real-world example, from some field other than climate science, of the use of the standard distribution of an “ensemble” of models as a measure of uncertainty of the mean of the model results. I don’t know of any other scientific field where they take two dozen different models of a chaotic process, average them to get the answer, and use the standard deviation as the uncertainty in the answer, but then I was born yesterday, and I could have missed it.
I know the IPCC does that, but who else does?
In your opening paragraph you were … mmm, let me say “rather emphatic” that use of the mean and standard deviation of model “ensembles” is an eminently reasonable and statistically defensible thing to do. GIven your passion and your certainty, I hope you are basing your rock-solid conviction on actual experience, from your knowledge of some real-world situation outside of climate science, where such a process of model averaging actually panned out.
You know, like say an example where some people took two dozen models of the stock market, and averaged them to get the forecast, and used the standard deviation of the models as the uncertainty of the forecast, and by gosh, it worked? That kind of example.
So I’ll stay tuned for your example of some other scientific field where they average two dozen functionally identical computer models to get their answer, and how they defend the practice, and how they explain the statistics …
In the meantime, I’ll continue to believe that in general, unvalidated and unverified computer model results are evidence of only one thing—the understandings and misunderstandings of the programmers.
My best to you, and thank you for your fascinating blog, I’m a regular reader.
w.
PS—Are there kinds of computer models where an average of their results might be appropriate? Absolutely.
One excellent example is a “Monte Carlo” analysis. Each realization of a Monte Carlo analysis can be considered as a different computer model of the situation. But note a few things you need to have for a successful Monte Carlo analysis. You need to ensure that:
1. N is large, typically in the hundreds of thousands if accuracy is an issue.
2. The parameters are randomly varied, that is to say the realizations are independent.
3. The model is able to accurately reproduce the important statistical characteristics of the situation being analyzed (e.g. mean, standard deviation, first differences, whatever matters to the situation).
4. The model fully explores the parameter space.
In particular, the choice of the model for the Monte Carlo analysis is a crucial and often misunderstood part of the analysis. If the model doesn’t accurately reflect reality, the Monte Carlo analysis is useless.
So yes, if those conditions are satisfied, then an average of those model results makes perfect sense.
Now note the opposite situation that exists with climate models:
1. N is very small, a couple dozen or so at best, half a dozen at worst.
2. The parameters are not independent and random, but are all tuned to reproduce the historical temperature record.
3. The models are not able to reproduce such things as the mean, standard deviation, and first differences of observed temperature variations on daily, monthly, or annual scales. They are very poor at reproducing anything but the temperatures to which they were tuned, with rainfall as a prime example of their inabilities.
4. They don’t even pretend to explore the climate parameter space, unless the entire climate parameter space is encompassed by climate sensitivity and a time constant …
These, inter alia, are some of the reasons that using the mean and the spread of climate models as a gauge of the underlying reality is a very bad idea.
Willis, brother, you’ve been lazy, as you admitted (“I don’t know of any other scientific field…”). There is an enormous literature on ensemble forecasting which you (and others) have neglected. Best read up on it. It’s in both atmospheric and statistics journals. I can’t summarize it all in 800 words.
The CPC’s climate forecasts have skill out a few months, or seasons, as is well documented. These are the short-term forecasts which I spoke of, and which are (I thought) so well known that I didn’t think it necessary to spell them out. They are a mixture of physical and statistical procedures.
Every time a forecaster looks at two models, and uses in any way, information from both, he has made an ensemble forecast.
I can only conclude that people so badly want Brown to be right, and the climate models to be wrong, that they are unable to separate, or unwilling to consider or read about, the philosophical/theoretical points I have been making again and again. That you are “baffled” by the psychic example proves this. You spent most of your time explaining, again, why long-range (years, decades) models are bad, ignoring the part where I said “I don’t care.”
For the last time, it is irrelevant whether any climate ensemble forecast is any good to the question whether ensemble forecasts make statistical sense. Irrelevant. Just as it is irrelevant to how a forecast (of any kind) was generated in how we judge the usefulness/goodness of the forecast. Why a model succeeded or failed are entirely different, logically distinct questions.
So allow me to amend my final answer (as you suggested). Brown in saying that ensemble forecasts don’t make statistical sense, and in his argument why they do not, is as wrong as a fervent, ill-educated activists who childishly use the word “denier.”
There is something I don’t understand. I don’t think Brown and Eschenbach are saying that “ensemble forecasts don’t make statistical sense”, but that some model’s ensemble don’t make statistical sense. Model with some characteristics (they highlighted being independent). And some use of the ensemble they described. But I don’t see Briggs addressing these points. Maybe he did, and I didn’t realise.
There was other point made by Brown. I am curious about it. Why don’t they discard the worst models, and advance with the less horrible ones? Does it have a statistical meaning not being able to select the best models?
One last question, related to the last one. Imagine the situation in 2050 is like this graphic.
http://plazamoyua.files.wordpress.com/2013/06/para-doom-1.png
Would they be able to say temperature in 2050 (and all the time in between) is “consistent with” the model’s mean because it is (just) inside the spread of models?
The problem is methodological, not theoretical. Each model contains somewhat different physics. The basic physical laws that the models try to capture are the same, but the way the models try to incorporate those laws differs. This is because the climate system is hugely more complicated than the largest computers can capture, so some phenomena have to be parameterized. These result in the different computed climatic reactions to the “forcing†of the climate by humanity and the natural drivers that are incorporated in the forecasts.
When you compare individual model’s predictions to the historical record, it becomes clear that some models do quite a bit better than the bulk of the ensemble. In fact, several models cannot be distinguished from reality by the usual statistical criteria [sic].
You know that what is done in science is to throw out the models that don’t fit the observational data and keep FOR NOW the ones that may be consistent with the data. In this way you can learn why it is that some models do better than others and make progress in understanding the dynamics of the system. This is research.
But this is not what is done. All the models are lumped into the statistical ensemble, as if each model were trying to “measure†the same thing, and all variations are a kind of noise, not different physics. The climate sensitivity and its range of uncertainty contained in the reports are obtained in this way. This enables all of the models to keep their status. But ultimately as the trend continues, it becomes obvious that the ensemble and its envelope are emerging out of the observational “signal.†This is where we are now.
Ah yes William, global warming proponents and skeptics alike have been very busy corrupting (misapplying?)statistical tools, unsupervised,on an industrial scale for years.
There’s really only one reason for using model ensembles – none of the individual models are trustworthy and you hope they all fail at different times and/or in different ways. So you’re kind of trading spectacularly wrong once in a while for a little wrong all the time.
It looks like wishes are fulfilled and its taken 22 years for constant small error to make the actual GAT slowly but surely drift outside the 95% confidence bound of ensemble prediction.
Significantly it drifted outside the lower bound of modeled temperature. I think all but the brain-dead and truly fanatic amongst the usual suspects are acknowledging something is happening that isn’t being modeled well in any ensemble members or something isn’t being measured well. My guess would be a combination of both.
There are at least a few good candidates for source of error and it could be some of each as they generally aren’t mutually exclusive.
A few more years of observation should help a lot.
The irony is that even if the nattering nabobs of negativity are right and fertilization of the atmosphere with CO2 is a very bad idea there is no practical political way to slow CO2 emission enough to make any difference. This should be evident by the fact that it has remained business-as-usual despite 25 years of global warming hyperbole. The answer to this problem, if it is indeed a problem, is on the shoulders of the science and engineering to come up with not better climate models but rather renewable energy that’s less expensive than fossil.
I believe synthetic biology is the go-to science and is the next transformative technology but that’s a different rant.
Ensemble forecasting is an application of the well-studied mixture model (http://en.wikipedia.org/wiki/Mixture_model) in both frequentist and Bayesian analyses. Same observed variables can be used in different models that comprise a mixture model (an ensemble), which makes the estimation of the standard error of the mean ensemble forecast computational complicated. Furthermore, the weights associated with each of the models can be estimated depending on the goodness of each model. The estimation would not require the “independence†of the models. It’s not news that practitioners can’t apply statistics correctly due to their misunderstanding, why would Brown say that ensemble forecasting doesn’t make statistical sense. Might I suggest that he take a walk to the department of statistical science at Duke? It’s simply ineffective to discuss statistical theories on a blog.
William, I appreciate your prompt response. You say:
I have not admitted to laziness, but to an inability to locate actual examples of “ensemble forecasting” being used in the real world outside of climate science. For example, does Boeing use an ensemble of CFD models, all of which give greatly varying results, and then average their results to design planes? Because the IPCC does that with their models of a chaotic turbulent climate, but I’ve never heard of Boeing doing that with their models of an equally chaotic turbulent airflow … but like I said, I was born yesterday, so I might have might it … that’s why I asked you for an example.
As a result, your allegation that I am intellectually lazy is both unpleasant and untrue, which is a bad start. You can truthfully accuse me of many things, I have the usual quota of faults, but if you look at the number, frequency, and variety of my scientific posts, it’s clear intellectual laziness is not among them …
I did not ask for a review of the literature on ensemble forecasting. I did not ask for an 800 word summary of ensemble forecasting. That, coupled with your misleading quotation of only the first few words of my request, is just handwaving to distract the rubes from the fact that you’re not answering my question.
What I actually requested was as follows:
I quite specifically asked for an actual example from outside the field of climate science, not an attack on my level of industriousness, not a survey of the field, not a summary. A real-world example. To make it crystal clear, I even gave an description of the type of example I was looking for, saying:
Instead of providing even one such example, you wave your hands, and accuse me of laziness. In the past, that has generally been a strong indication that the person making such accusations doesn’t have any more examples than I do. And that’s none.
However, the past is only a poor guide to the present, so perhaps you do have such an example. If so, now is the time to present it.
In addition to the example just discussed, I had also asked you for a citation to your claim that GCM ensembles gave good short-range forecasts, saying:
Again, like your first paragraph, your response to my request is totally unresponsive. You do understand what a “citation” and an “example” are, don’t you? Waving your hands at the Climate Prediction Center and claiming that their GCM forecasts “have skill out a few months, or seasons” and claiming that this is “well documented” is just a roundabout way of saying that either you don’t have any citations, or you’re unwilling to give citations. Neither one looks good.
If their successes are “well documented” as you claim, then surely you can point to the documents that show the ensembles have skill at predicting what will happen “a few seasons” out. I have never seen any such evidence. In fact, from their website I see little evidence that the CPC uses ensembles of climate models other than experimentally.
I do see examples at CPC of them using an average of repeated runs of a single weather model with different initialization points, but that’s not what we’re discussing here/ We’re talking about groups of different individual global climate models, not repeated runs of a single weather model.
But I find very little about your claim that they are using an ensemble of GCMs which is giving good results with a time horizon of “a few seasons” as you claim. The only work I can find that they are doing with GCMs is an “experimental multi-model seasonal forecasting system consisting of coupled models from US modeling centers” here … and they have given a number of links to pretty pictures of their predictions using the ensembles, but they somehow didn’t give any obvious link to the report of their tests showing their skill at forecasting out “a few seasons”.
I’ve looked, and while I’ve found nice pictures of their results, I can’t find any analysis or testing of their results with the GCM ensemble. So could you provide a citation to what you say are their “well-documented” tests showing their experimental multi-model multi-seasonal results “have skill” as you claim? I couldn’t find them, but perhaps that’s just my intellectual laziness …
And please, no more handwaving. If you don’t have a citation, just say so.
Again, other than being a trivial example, it is unresponsive to the discussion at hand. Let me say once again what I asked for an example of, so you can see how badly your example fails. I asked for an example of :
In response, you tell me that whenever a scientist looks at two models of any type and uses information from them in any way, that’s an example of what we’re talking about …
Is my writing really that unclear? Is the concept of “using the standard deviation of a climate model ensemble forecast as a measure of the underlying uncertainty” that unclear? We’re not discussing some forecaster looking at the pressure predictions from one model and the temperature predictions from another model and making his own prediction. That may be “ensemble forecasting” to you, but that’s assuredly not the ensemble forecasting we’re talking about. We are discussing the use of the statistical properties of the results of dozens of GCMs which have never been shown to have any significant skill.
Oh. OK. I see. If someone is baffled by something you write, you don’t explain it. You just abuse them for being baffled, saying that their inability to understand what you’ve written just proves your point …
So as an admittedly amateur and allegedly intellectually lazy statistician, am I correct in assuming that the possibility that your writing is, well, inherently baffling at times is not a Bayesian prior on your planet?
I do love the statistical underpinnings of your argument itself, however. You’ve just claimed that the fact that people are unable to follow your logic is evidence that your logic is correct … coming from a statistician, that’s absolutely priceless.
Or maybe you just averaged your answer out with two dozen other ideas and you’re giving me an ensemble response, I don’t know … but what you never did is to explain what your example, of a psychic able to experimentally demonstrate that she could correctly forecast the future climate, has to do with climate models that haven’t demonstrated any ability to forecast the future climate. What on earth did that example have to to with climate models? It may be obvious to you, but on this side of the screen it doesn’t make sense.
William, I don’t have a clue what you mean by whether a climate model forecast is “any good”. Good for what? Good by what measure? I pointed this out above, and you didn’t respond.
Nor do I understand what you mean by “makes statistical sense”. Does it “make statistical sense” to average a dozen gypsy forecasts? According to your definition, absolutely, because you say whether their forecasts are correct is immaterial to whether it makes statistical sense to average their forecasts, and take their standard deviation and claim that the standard deviation is an unbiased estimator of the real-world uncertainty of their forecast … which is what the modelers are doing, and which is the subject under discussion.
According to my definition of “statistical sense”, on the other hand, averages and standard deviations of gypsy forecasts provide no valuable information at all. Why? Because, as Robert Brown pointed out about the climate models,
Here is a very important question, one which I would like very much for you to answer.
If nineteen out of twenty gypsies tell you to give them your money to be blessed or you’ll be doomed for life, is that finding statistically significant at a p-value of 0.05 regarding whether you will be doomed for life if you don’t comply?
Note that the question is not whether the finding has significance about what individual gypsies might forecast.
The question is whether it has significance about the real world, about what you should do. Because that is the exact claim that the modelers are making, that the statistics of the model results apply to the real world, that there is an X% chance that the actual result will fall within the statistical spread of the model.
Seriously, William. You need to answer that question, in public, for all of us. Does it make statistical sense to say that statistically significant agreement among gypsies means something about the real world?
Or does it maybe just mean that they all are operating under the sam plan, to steal your money? That is to say, their opinions are “not independent”, and as a result it doesn’t make statistical sense to apply statistics to their forecasts in the first place?
Your claim, applied to the gypsies, is as follows:
I say that’s nonsense. I say that applying any statistical procedure to an ensemble of gypsy forecasts is an abuse of statistics, but heck, I’ve been wrong before. Obviously, that means that “makes statistical sense” must mean something different to you than to me.
So if you could explain just how and where I’m wrong about the gypsies, and why ensembles of gypsy forecasts make perfect statistical sense, and what kind of statistical sense they actually make, I’m all ears. I love to learn from guys that know their field, and you’re one of them in my book.
William, that’s the same foolish thing you did in your previous post, argument by assertion, and it doesn’t strengthen your claims. I understand that you are firmly convinced you are right, but simply asserting it over and over in increasingly emphatic tones is as meaningless as averaging climate models … not to mention that it makes you look shrill, frantic, and lacking in arguments with substance. I don’t think any of those are true, I think you’re a gifted statistician, but when you do that kind of thing where you say something like “you’re wrong, you’re absolutely and inutterably wrongity-wrong, I say again you’re abysmally wrong and I really truly mean it this time” thing, well …
My best regards to you, and thank you again for your prompt responses … I can only wish that they were actually, well, more responsive to my questions and requests for citations and examples.
w.
JH,
“It’s simply ineffective to discuss statistical theories on a blog.”
Sister, that might be the wisest thing you’ve ever said.
Suppose you were to give a mathematical question in a final exam to your students. Would the best answer the average of the answers turned in by the students? Wouldn’t just taking the answer turned in by an “A” student be better than that?
Most climate models, like “D” students, are consistently wrong. Averaging over the answers turned in by “D” students is not rational.
@Willis Eschenbach
In Artificial Intelligence, there is a concept called Ensemble Learning. A couple of hypotheses’ predictions are combined in the hope that the combined prediction is better.
There is also Bayesian Learning, where hypotheses are weighted according to some probability. This appears to be exactly the way that the climate models are combined into an ensemble.
“I can only conclude that people so badly want Brown to be right, and the climate models to be wrong, that they are unable to separate, or unwilling to consider or read about, the philosophical/theoretical points I have been making again and again.”
I think the problem may be that people can’t believe you’d write what looks like an extensive criticism to make what would appear, from the layman’s perspective, such a minor point.
If I’m reading you right, you’re just arguing about whether the term “meaningless” is the same as “wrong”. If I take 10 models for stock market moves, wheat crop projections, record sales for the latest Justin Bieber album, etc. add them all together to make a forecast of weather, and use the day of the week on which the model was last run as the uncertainty, this is all statistically ‘meaningful’, in the sense that we know what it means. We just don’t have any rational reason to think it’s right.
That’s all fine and dandy from a rigorous philosophical/mathematical point of view, but what Robert Brown is talking about is whether the mathematical procedure makes any sense in the context of making a useful (or at least vaguely justifiable) prediction with a verified (and ideally validated) uncertainty. It’s a different question. A different sense of “meaningful”.
Possibly not rigorously expressed, but what do you expect in blog comments?
Willis,
Model averaging is a common technique when employing Bayes Networks. Often the averaging is done when the causal direction between two nodes can’t be positively established. Why exactly do you think the models must be independent? Anyway, certainly you’ve heard of Google. From not-the-best source there are these:
http://en.wikipedia.org/wiki/Ensemble_forecasting
http://en.wikipedia.org/wiki/Ensemble_learning
and they have links to other places. Time to widen your experience.
Pingback: It speaks clearly to truth [Stoat] | Blog Submit
DAV,
I don’t think your Wikipedia links will answer Willis’s question.
The first link only gives weather forecasting as an application, which Willis specifically excluded, and says “It is common for the ensemble spread to be too small to incorporate the solution which verifies, which can lead to a misdiagnosis of model uncertainty” which is exactly the point Willis is complaining about. The ensemble spread is not a valid way to calculate the uncertainty.
The second link is to a fancy curve-fitting algorithm, in which models are used as basis functions for a hypothesis space, to which a training data set is fitted. Again, it says that over-fitting can be a problem, and it makes no claim that the spread of the ensemble is a valid way to calculate the uncertainty in the result.
The only way to determine the uncertainty of a model is by out-of-sample verification. You use the model to predict observations that are independent of the training set, and you compare the predictions to the observations and see how accurate they are. Model uncertainty can only be calculated from verification data. It cannot be calculated simply from the collection of model outputs, with no reference to reality.
People seem to be talking past one another. It’s not that Willis doesn’t know about ensemble forecasting – what he’s saying is that ensemble spread isn’t a valid way to determine the model uncertainty. And I don’t think it’s that William thinks the IPCC ensemble is a correctly validated forecast – he’s making a rather abstruse philosophical/theoretical point about the semantics, that has got everyone confused.
Is the IPCC method of taking ensemble spread as the forecast verification uncertainty right or not? Most people think the IPCC method is wrong, but Brigg’s reputation is such that now people are not quite so sure.
The IPCC method is not good enough because it isn’t physics. No physicist in his right mind is going to make an ensemble of Newtonian and Relativistic Mechanics with the idea that the outcomes of that ensemble is better that the outcome of either Newtonian or Relativistic Mechanics.
Dav, do the Bayesian Networks then take a standard deviation from that average and try to justify the probability of their answer being right?
We all know why real climate science (TM) averages their models – because they all have to earn a crust. Worse, the ones that track the actual temperature are the least alarming and we can’t have that.
But the process of getting a 95% confidence from that average, and then using that to tell us what the confidence is in the average is just crazy.
If I make 100 predictions, each with a stated individual 95% confidence of +/- 20, the 95% confidence of the ensemble is not found by excluding the 5 predictions furthest from the mean. Yet that is how climate psyience works.
Mooloo & Nullius
As far as I know, there is no technique (other than observation) which will tell you how well a model will perform. If there was, why would anyone waste time testing?
IF your data on-hand is representative then a leave-one-out approach will give you a reasonable idea though. The catch is in “representative”.
I think there’s confusion between the spread of the prediction and its supposed performance. Some (or maybe all) of the modelers (and others) may share this confusion. “confidence interval” is a badly used (if not bad per se) term. It usually (if not always) is about the setting of model parameters and not the model’s performance. (See the never-ending discussions here about p-value). Many people confuse parameter settings with predictive performance. It isn’t surprising that climate modeling has its share.
The GCM predictions have overstated what has occurred in the last 10+ years which is where everyone should concentrate. How they arrive at their predictions is only of interest if the intention is fixing them otherwise it’s totally irrelevant.
“I think there’s confusion between the spread of the prediction and its supposed performance.”
Yes. Exactly.
I’m no expert at modelling, but the way I see it is that you start with a training set of data to generate a model, you use it to make predictions about the probability of observations outside the training set (and far enough outside for any correlations not to carry over) to determine how accurate the model is (model verification) and that gives you some bound on your uncertainties. The more data you use to check, the more tightly the bounds approach the actual model uncertainty. And then you make your actual prediction, report the uncertainty bounds you’ve got, and compare those bounds against your accuracy requirement (validation) to confirm whether the prediction is usable.
If you want to build a model from an ensemble of other models, you can. But it’s still got to go through verification and validation, and the uncertainty is still determined from the comparison of past predictions with reality, *not* simply from the ensemble spread of the models you happened to chuck into the pool.
And this is what the IPCC does. It says “Oh, here are a bunch of models that some people use, and if you stick them on a spaghetti graph you can see the ensemble average and spread does something scary, so you’d better do something about it…” The models are not verified or validated. The ensemble spread is not the model uncertainty. And it’s the silliness of saying it is that Robert Brown is complaining about.
“The GCM predictions have overstated what has occurred in the last 10+ years which is where everyone should concentrate.”
Quite so. Since these GCMs were published around 10 years ago, the last 10 years is their out-of-sample verification trial. And they’ve done pretty badly. So you either have to declare the models falsified if they’re supposed to be more accurate than that, or you have to expand the uncertainty bands wide enough to encompass observation if they’re not.
Brown is saying that we probably ought to chuck out most of those ensemble members as falsified, stick broader bands on the rest, and try again. Given all the arguments about what the pause means, we can’t consider the models verified, let alone validated, even at this quite slack level, and we therefore shouldn’t be using them for making policy decisions.
Which I think (hope!) you and Briggs would agree with, but by saying Brown is hopelessly wrong about everything he said, Briggs has cast everyone into confusion. Is he saying you *can* determine the uncertainty from the ensemble spread?!
I think Briggs is actually trying to make a different point about the semantics: that you can propose unverified models with *hypothesised* uncertainty bounds that can then be tested. That from his quick glance at the chart, he thinks probably fail the test. It’s a different interpretation of what the models are presumed to mean.
I can see what he’s saying, but I think that when such bounds are presented in a document for decisionmakers, it’s only reasonable to interpret them as verified prediction uncertainties, which they’re not.
“And this is what the IPCC does. It says “Oh, here are a bunch of models that some people use, and if you stick them on a spaghetti graph you can see the ensemble average and spread does something scary, so you’d better do something about it…†The models are not verified or validated. The ensemble spread is not the model uncertainty. And it’s the silliness of saying it is that Robert Brown is complaining about.”
And the silliness of your complaints is that you give no links or references to where the IPCC actually says this.
Nullius,
rgb said:
Sounds an awful lot like he is saying an ensemble average prediction makes no statistical sense. This is not correct.
Also, what is this business about differing by truly random variations? It sounds like he’s saying that’s required. If so, that’s not correct either. The models I average are nearly identical and the variations are hardly random.
If all he was complaining about was the use of the model spread as an indicator of performance why didn’t he just stop there?
—
you use it to make predictions about the probability of observations outside the training set (and far enough outside for any correlations not to carry over)
Verify with data not used in training, yes, but if there is no correlation how would it ever make a valid prediction? The idea is to test it with data where it hasn’t been given the answers during training. Hopefully, there’s some correlation with the training data.
Brown is saying that we probably ought to chuck out most of those ensemble members as falsified, stick broader bands on the rest, and try again.
Maybe so but the actual model is the ensemble average. As I said, and so has Briggs, how they arrived at their prediction is irrelevant. What’s important is if it holds. If it does then we can scratch our collective heads and wonder why. If it doesn’t then somebody needs to go back to the drawing board.
It’s beginning to appear that the latter course of action is in order.
Lets not beat this to death.
In my mind, it doesn’t feel like you’re addressing two important points:
1. The ensemble climate forecasters appear to claim that the spread of the model forecasts is the spread of (real-world) natural variation. This is not true: it is simply the spread of the model forecasts.
As you point out, I can define a forecast that’s the mean of any set of measurements I want, and I can also define that forecast’s spread as the standard deviation of those measurements. The mean may turn out to be a better forecast than picking any one of the models’ results and declaring it “the winner”. But the spread does not reflect the variation of the real world, by construction.
2. Anyone with any sense, when taking a mean of some values, will care about those values. If I tell you I’m determining the temperature of my city by averaging a measurement in a park, and a measurement of the air coming out of the vent in my office, I think you’d rightly question my intelligence. Statistically, I can certainly do this, but that’s an error that’s similar to that made by p-value obsessors who equate “statistically significant” with “significant”.
Looking at the various spaghetti plots of the models in the ensemble, many of the models have problems I could describe with the technical term: “they suck”. I believe they usually cover that term in Chapter -2 of any statistics book: before you go performing statistics on data, you need to have an understanding of and confidence in your data — in this case model forecasts.
It’s fairly obvious that many of these they-suck models’ results are thrown into the ensemble not because of scientific value but for political purposes: a) so as not to offend a particular research center by excluding their model, and b) to include ridiculously-biased (high) results to broaden your intervals enough to make falsification more difficult and so you can point to these high upper limits and make your (high-biased) mean look modest.
“Sounds an awful lot like he is saying an ensemble average prediction makes no statistical sense.”
That depends. “Prediction” in what sense? Verified? Validated? And what statistical sense do you think it makes?
“Also, what is this business about differing by truly random variations?”
One possible argument for doing ensemble modelling is that you have a validated model of the transfer function from input to output, but you don’t know what the value of the input is. If you can guess at a distribution to express your uncertainty, then selecting an ensemble of inputs with this distribution and running them through the model gives you an approximate distribution for the outputs. You need to add further blurring to account for the uncertainty about the transfer function. The output is not generally a verified/validated prediction of the outcome (because you haven’t shown that your input distribution includes the true input value), but could be a verified/validated expression of your uncertainty, conditional on the input. (It’s kinda like the IPCC projection/prediction distinction.)
So an ensemble constructed such that the parameters/inputs in which they differ follows a specified distribution could be interpreted as an application of this method. You can then ask whether the distribution you use is justifiable, and how much uncertainty you should add to account for the model verification, but we can fit it in a conceptual framework of verified models. We know what bits are verified, and what bits are arbitrary, and what conditional conclusions we can draw.
However, if the distribution of inputs does not reflect any specific input uncertainty, but is an arbitrary or accidental selection, then the distribution of the outputs is likewise arbitrary or accidental. The IPCC selection is based on stuff like what groups got funding, passed their models on to others, whose ideas got adopted and adapted by other groups, and worst, by the extent to which the answers they gave conformed to the researchers’ preferences and expectations. Thus, the distribution of the outputs they give are more related to these accidental historical factors and biases than anything to do with what we should expect to happen to the weather.
“The models I average are nearly identical and the variations are hardly random.”
Oh dear! 🙂
“Verify with data not used in training, yes, but if there is no correlation how would it ever make a valid prediction?”
What I meant was that with stochastic time series, the value at one time is strongly correlated with the values at nearby times. The temperature this year is very similar to the temperature last year, with a small random offset up or down. So if last year was in the training set and this year in the verification set, you’re verifying with a data value that is strongly related to the training data – that is in essence the training data plus a small amount of noise. So it’s not sufficient that the data simply be distinct, they need to be far enough separated that one is not strongly dependent on the other because of the structural relationships induced by the underlying mechanism.
We would, of course, expect the input/output relationships to be correlated.
“Lets not beat this to death.”
Indeed, but some clarification might be in order.
Sorry, but in my view you fail miserably to understand the issue and what was said and jumped on only the point of the mean of an ensemble.
To your example of the 3 forecast, just add my 50 various forecasts for tomorrow with temperature between -100°C,-99.8, -99.6…-90.2°C.
Now do the ensemble of the 53 forecasts. your 3 examples and the 50 I added. Does it make any sense? Or should you take any selection before doing the mean? What is the statistical sense of this mean?
Does the mean give any information about probability of future temperature?
Or maybe I have a flaw in my models?
Pingback: Weekly Climate and Energy News Roundup | Watts Up With That?
Or should you take any selection before doing the mean?
Presumably the mean is being used because it isn’t clear which predictor(s) to select.
That depends. “Prediction†in what sense? Verified? Validated?
Think about that.
By logical rule, a conclusion cannot be reached from the above debate, for the various arguments that have been made in it are written in a language that contains polysemic terms, including “model,” “prediction” and “projection.”
William: Moronically, your post ignores the key question. The PURPOSE of the IPCC’s ensemble of models is to inform policymakers about the range of future climate that is consistent with our physical understanding of the climate system (assuming we follow a particular emissions scenario). Is the IPCC’s analysis of models suitable for THIS PURPOSE?
Possible areas of agreement: It doesn’t matter if the models have limitations, are wrong, or haven’t been “validated” by new observations. It doesn’t matter if they are used to project 20 years into the future or 200 (though the uncertainty range should increase in the latter case). Despite skepticism, these models may be the best method we have for translating our knowledge of physics and chemistry into future climate change. (Energy balance models constrained to fit observations may now be better. For example, Otto et al, Nature Geoscience 6, 415–416 (2013)) There is nothing wrong, in principle, with statistically combining the results of many such models, whether they are flawed or not. (A candid comparison between observations and the ensemble forecasts is needed.)
Statistical analysis of multiple runs with one model can partially address the problem of chaotic behavior.
Disagreement: The use of an ensemble of models CAN address the problem that AOGCMs contain several dozen parameters that are not precisely known and must be chosen (from a physically reasonable range). The ensemble of models was not chosen so that the full range of parameters consistent with laboratory experiments and present-day climate have been independently explored. This appears to be Brown’s main point. Instead, one set of parameters for each model has been selected by a dubious “tuning” process.
For example, the IPCC’s “ensemble of opportunity” (as it was called in AR4) only contains models with parameters that result in high sensitivity to aerosols and high sensitivity to GHGs OR low sensitivity to aerosols and low sensitivity to GHGs. Otherwise, the models chosen wouldn’t be able to reproduce 20th century warming and attribute most of it to man. They also used parameters that result in a high rate of diffusion of heat into the deep ocean creating a long lag between forcing and response. Otherwise an ECS of 3 would demand a 20th century warming of 1.5 degC. For all we know, natural variability – without man’s influence – could have made the 20th century into another MWP or LIA. Furthermore, spectral analysis of the historical temperature record suggests that the PDO may produce 60-year temperature oscillations similar to those associated with ENSO, only larger in amplitude. Nevertheless, every member of the IPCC’s ensemble showed little change without anthropogenic forcings and about 0.8 degC of warming with those forcings. Tuning models to match the historical record – when we don’t understand the role of natural variability in the historical record – is absurd. Using the same models for attribution is fraudulent. By randomly varying only 6 parameters associated with clouds and precipitation (out of 21 parameters in a simplified model), Stainforth et al found literally thousand of models that were equally good at representing current climate and that predicted a vastly wider range of future climates than the IPCC’s “ensemble of opportunity”. Therefore, the IPCC’s “ensemble of opportunity” has a range of warming that is too narrow.
(Some models are now trying to predict how changing precipitation will change surface vegetation, which has a strong influence on local precipitation and the carbon cycle. Higher CO2 also increases the rate at which some plants can carry out photosynthesis. The uncertainty in the parameters associated with these biological processes is likely far larger than the parameters involved with physical and chemical processes.)
Analysis of the spaghetti graphs also doesn’t take into account the uncertainty in the carbon cycle – our ability to predict what atmospheric level of CO2 and other GHGs will be present as we follow a particular emission scenario.
The models used in weather forecasts presumably contain parameters like those in climate models. Unlike the parameters used in AOGCMs, the parameters used in weather forecasting have been optimized by comparing numerous forecasts to observations. We don’t need to know the full range of weather a week from today that is consistent with our understanding of the chemistry and physics of the atmosphere, because experience tells us the forecast accuracy with a particular set of parameters by a particular model.
Frank (24 June 2013 at 12:14 am):
You state that the purpose of the IPCC’s ensemble of models is “…to inform policy makers…” To inform a policy maker is to provide him or her with information about the outcomes of climatological events. For IPCC climatology, though, there are no such events and thus can be no such information. As the ensemble of models conveys no information to policy makers about the outcomes from their policy decisions, this ensemble fails to provide governments with a basis for regulation of CO2 emissions.
By the way, in addition to providing a policy maker with no information an element of the the IPCC’s ensemble is insusceptible to being validated. On the other hand, a model of the type that provides a policy maker with information is susceptible to being validated. Susceptibility to being validated is equivalent to ability to provide information.
Pingback: Thoughts and Observations: Weekend Edition « Yard Sale of the Mind
DAV on 23 June 2013 at 6:35 pm said:
“Presumably the mean is being used because it isn’t clear which predictor(s) to select.”
Which is exactly the point that rgbatduke has raised:
“So, should we take the mean of the ensemble of “physics based†models for the quantum electronic structure of atomic carbon and treat it as the best predictionof carbon’s quantum structure? Only if we are very stupid or insane or want to sell something.”
….
“Which of these is going to be the winner? LDF, of course. Why? Because theparameters are adjusted to give the best fit to the actual empirical spectrum of Carbon. All of the others are going to underestimate the correlation hole, and their errors will be systematically deviant from the correct spectrum. ”
“What one would do in the real world is measure the spectrum of Carbon, compare it to the predictions of the models, and then hand out the ribbons to the winners! Not the other way around.”
DAV on 23 June 2013 at 6:35 pm said:
“That depends. “Prediction†in what sense? Verified? Validated?
Think about that.”
It does not make any sense to add a model to a mean whose result deviate from the data.
If an engineer would take result from a set of models that shows the car will stop in x meters, underestimating biased the breaking distance with constant 50%x too short, he might get into serious trouble.
And the statistician who validated that mean too. Think about it.
Terry Oldberg wrote (on 24 June 2013 at 10:17 am): “To inform a policy maker is to provide him or her with information about the outcomes of climatological events.” This is totally wrong, policymakers can be aided by being informed of projections as well as outcomes. A classic example was Einstein’s letter to FDR warning him that recently developed physics suggested that it might be possible to build an atomic bomb. That projection was based on the observation that spontaneous fission of uranium resulted in a loss of mass and presumably release of energy according to E = mc2. Policymakers are also aided by projections of the future course of hurricanes, what fiscal policies are likely to do to our economy, and how much more traffic local roads will carry a decade from now. The reliability of projections is critical to their usefulness. We can’t evacuate the whole Gulf Coast every time a hurricane develops in the Gulf of Mexico. If economists can’t agree whether unemployment will rise 0.5% or 5% if an austerity plan is enacted, they aren’t likely to influence policymakers. If projections of traffic increase range from 5% to 50%, policymakers won’t know whether road improvements are needed.
Laboratory measurements of the optical properties of atmospheric gases and energy balance calculations indicate that doubling the concentration of CO2 in the atmosphere with reduce outgoing radiation (cooling the earth) by about 4 W/m2, which will eventually result in about 1 degC of warming if nothing else changes. (This physics is better validated today than the release of energy from spontaneous fission of uranium was when Einstein wrote his letter.) However, warming will cause other changes – feedbacks – which can’t be calculated without using AOGCMs. Since a projected warming of 1 degC isn’t likely to cause enough damage to prompt us to forego burning of cheap fossil fuels (and could even be a net benefit), the uncertainties in calculating feedback amplification of warming are critical to the case for restricting emission of GHGs. Thus my question for our host: Has the IPCC accurately reported the statistical uncertainty in their projections of warming?
plazaeme on 22 June 2013 at 6:49 am said: “Why don’t they discard the worst models, and advance with the less horrible ones? Does it have a statistical meaning not being able to select the best models?”
The IPCC practices “model democracy” and treats all serious models proposed by national governments (ie major efforts funded by those governments) as being equally valid. From a purely statistical point of view, chaotic behavior makes it difficult to conclusively show that one particular model performs differently from both other models and from observations. Furthermore, a single outlier won’t have much effect on the model mean and standard deviation.
From a cynical perspective, an AOGCM that was a real outlier would probably be re-tuned by adjusting parameters until it behaved like all the others. What would happen to your funding if your model was the only one that predicted a climate sensitivity of 6 or 1.5? Or didn’t attribute most recent warming to man? Stainforth has shown that many models constructed by random selection of model parameters) from a range consistent with laboratory experiments) provide accurate descriptions of today’s climate. His results suggest that tuning parameters one at a time is unlikely to find an overall optimum. Without clear direction, one might expect developers to unconsciously settle on parameters that produce expected outcomes.
Frank (June 24, 2013 at 4:16 pm):
Thank you for taking the time to reply. The entity that, in technical English, is called “information” is defined in terms of the outcomes of events. I gather that you agree with me that the IPCC general circulation models do not reference climatological events or the outcomes of these events and thus that these models provide a policy maker with no information regarding the outcomes from their policy decisions. However, you feel that by conveying projections from these models to policy makers one provides these policy makers with a kind of information.
Your feeling is, however, incorrect. It is incorrect because “information” is defined in terms of observables and while the outcomes of events are observables the projections of models are not observables in other than the trivial sense that the numbers which comprise a projection can be observed if printed.
Pingback: How should we interpret an ensemble of models? Part I: Weather models | Climate Etc.
Terry: You may be confusing “information” with “data”; data is a subset of information. See definitions below. IMO, Einstein’s letter, economic projections, hurricane path projections, and traffic projections all provide policymakers with valuable information, with conveying observations or data. You are certainly free to believe only in what has been observed, but you will be missing out on a lot of valuable information I want our policymakers to use.
In a court of law, “fact witnesses” are restricted to testifying about things they observed; “expert witnesses” are allowed to testify about “opinion and diagnosis”. In both cases, the jury is allowed to form their own opinion about the reliability of either kind of witness. The IPCCs projections are an attempt by experts to diagnose what is likely to happen after certain emission scenarios. Unfortunately, IPCC experts claim that we must take action long before their projections can be validated by observing outcomes, an extremely awkward situation. On the other hand, projections about the likelihood of and likely cost should a hurricane landfall near New Orleans would have useful before Katrina. There are projections about the amount of damage that will occur in SF or LA when the next big earthquake hits the San Andreas and building codes designed to prevent structures from collapsing under the projected stress.
Information:
1) : the communication or reception of knowledge or intelligence.
Data:
1) factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation
2) information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful
3) information in numerical form that can be digitally transmitted or processed
Projection:
9) an estimate of future possibilities based on a current trend
Observation:
3) : a judgment on or inference from what one has observed; broadly : remark, statement
‘An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements’ by John R. Taylor
With all due respect Briggs, you should buy this book, read it, and then apologize to Brown.
Legion,
Tell the truth: you haven’t bothered to read any of the material on ensemble forecasting, whereas I have read plenty on measurement error (including writing about it today, coincidentally).
But if you really think you’ve found the zinger which proves all that material false, put it down here and we’ll take a look.
Frank (24 June 2013 at 11:31 pm):
As “information” has a well known mathematical definition, one needs no other definition for it. This definition makes clear that “data” and “information” are independent concepts.
To have prior information about the outcomes of events is a necessity for one to have a chance at controlling a system. In the control of a system, one places this system in an initial state that is highly likely to evolve into the desired final state. Let A designate the initial state and let B designate the desired final state. At the time at which the system is placed into state A, A is observed but B is inferred. As it is inferred, that the final state is B is uncertain.
An inference has a unique measure. The unique measure of an inference is the information that is missing in it for a deductive conclusion about the final state, per event, the so-called “entropy.” If the entropy is high, there is little chance that the final state will be the desired state, B. Thus, the designer of a control system has reason to reduce the entropy to as low a level as possible.
While the IPCC climate models make projections, they make no inferences. Thus, to control of the climate through the use of one of them would be impossible.
Dang … still not one single example of the use of model ensembles outside of climate science, despite Mr. Briggs repeatedly assuring us that its a valid and useful technique. How valid and useful can it be if no one uses it? Why hasn’t William (or anyone) given us an example we can actually discuss, a real situation we can sink our teeth into where the procedure actually works? If the ensemble method gives better answers than a single model, how come Boeing doesn’t use the average of two dozen different CFD models, with each CFD model using their own idiosyncratic selection of input variables like the global climate models use, to design its planes?
Plus, I’m still waiting for William (or anyone) to give the answer to my question—if 19 out of 20 gypsies say I should give them my money to be blessed, is that result significant at a p-value of 0.05 regarding whether in the real world I should give a gypsy my money to be blessed or not? Because in essence, that is what the IPCC is doing.
Next, let me point out a critical misunderstanding that is going on in this thread, one that I think is responsible for the majority of the confusion. The problem is that the words “ensemble” and “model ensemble” are being used for two very, very different situations.
The European Centre for Medium-Range Weather Forecasting (ECMWF), for example, uses what they call “ensemble forecasting”. In that procedure, they take one model made by one group, a model which has been repeated tested and has been shown to generate successful forecasts. They vary both the input data and the model parameters slightly, and they generate as many results from that one model as they think they need. They call this a model “ensemble”, and they use the mean and standard deviation of the results as a measure of the central tendency and forecast uncertainty of the likely real world outcomes.
I see this as a valid statistical procedure … but it’s not a model ensemble, it’s a Monte Carlo analysis
The IPCC, on the other hand, also uses what they call model ensembles. But in the IPCC version, they take a couple dozen models made by different groups. Each group uses its own ideas about the physics involved, and each group uses its own different choice of forcings to use as inputs. You cannot generate a hundred or a thousand results, you have only as many results as there are actual different models. You know nothing about the validity or accuracy of any of the various models, or which ones are better than the others. The IPCC call this a model “ensemble”, and they use the mean and standard deviation of the results as a measure of the central tendency and forecast uncertainty of the likely real world outcomes.
I see this as a model ensemble … but it’s not a valid statistical procedure, it’s a bridge too far.
In terms of their relevance to the real world, the statistics of this IPCC type of “ensemble” are very different from the statistics of the ECMWF type of “ensemble”, for a number of reasons—in part because N can be varied at will in one but not the other, in part because the ECMWF realizations are statistically independent, in part because the climate models have failed when they have been tested “out of sample” on things like the recent pause in warming.
Many folks here, including Mr. Briggs, are talking entirely or at least in large part about the first type of “ensemble”, the Monte Carlo type used by the ECMWF. And if you follow Mr. Briggs advice and “read … the material on ensemble forecasting”, that’s mostly what you will find.
For me, however, the ECMWF are not using a “model ensemble” at all. Instead, they are just doing a “Monte Carlo” type of analysis, where one can generate any number of realizations, each one with slightly different random model parameters and starting states. I see that as a valid procedure, one that I use myself … but it’s not a “model ensemble” in the IPCC sense, the ECMWF are using only one well-tested model and can generate realizations at will.
More to the point, the ECMWF type of “ensemble” is not what either Robert Brown or I are discussing. We’re talking about the IPCC use of a model ensemble, in which you cannot generate any number of realizations you’d like, each model uses a different set of inputs, and you have no idea of the validity, convergence, accuracy, or stability of any of the models. No one has come up with a single example of the use of such a procedure outside of climate science.
To be fair, someone above tried, they said:
Actually, that’s an excellent example of why Mr. Briggs is wrong. Studies have shown that the financial experts do no better than chance at guessing the future. And as a result, the mean of their claims has no … well, the mean has no meaning in the real world.
If they could do what you think, if your averaging procedure were valid in all cases, a company could simply hire a dozen of such experts, average their guesses, and make big money. Not only that, but the more experts they hired, the more accurate the mean of their guesses would be … but sadly for my dreams of infinite wealth, no, such a survey would tell us nothing but what the average stock market guru thinks—just like the guesses of the climate modelers, the statistics and mean and standard deviations of the guesses of stock market experts indicate nothing better than flipping a coin about future state of the real world,.
And this is the problem that William Briggs still hasn’t touched, the problem I pointed out about gypsy forecasts (or those of stock market experts). Yes, statistics about the guesses of gypsies and “stock market experts” and global climate models are indeed meaningful and valid, as Mr. Briggs argues … but only about the forecasts themselves. The statistics about their guesses have no meaning about the real world, which is Robert Brown’s point that I echo strongly. In particular, the spread of the climate model forecasts is NOT an unbiased estimator of the chance that their forecast is meaningful or that the actual outcome will be within those bounds.
Again, my thanks to William Briggs for his excellent blog and for delineating and hosting this important discussion.
w.
Willis: Here’s a reference to the apparently-successful use of ensembles of models outside of climate science: web.duke.edu/methods/pdfs/EBMASingleSpaced.pdf
IMO, the proper statistical analysis of the output of an ensemble of climate models is a totally separate subject from the reliability of those models. We and the models don’t know why there has been a pause in warming, but let’s hypothesize that it is due to a 60-year cycle associated with the PDO with a trough to peak amplitude of 1 degC (as discussed recently at Lucia’s). If climate sensitivity is 3.0 and we double CO2 over the next century, it should be about 3 degC higher 120 years from now (same point in the cycle). Over the period 105-120 years from now, there would have been a pause in warming (or perhaps cooling if we’re running out of fossil fuels), but the period 85-105 years from now would have seen almost 1 degC of warming. So, even if unknown oscillations or chaos or other unknown factors (causes of LIA or MWP?) have made the predictions of climate models look really bad recently – if models correctly handle water vapor and cloud feedbacks (the main factors behind an ECS of 3) – it could be uncomfortably warm in 2100. So it is important that the IPCC correctly describe the uncertainty in their ensemble predictions, even if their predictions look ridiculously bad now.
Ensembles of models ARE a bit like doing Monte Carlo calculations. This is particularly true when models are initialized with different conditions to address the problem of chaotic behavior.
Unfortunately, we need to address more than uncertainty in initialization conditions. We know a range, but not a precise value, for the several dozen parameters that are used to implement the known physics and chemistry all models use. What is the best value for the parameter(s) that controls heat diffusion from the mixed layer into the deeper ocean? We know a lot of physics concerning cloud formation, but it can’t be implemented on large grid cells. The Hadley model reduces that physics to six parameters that are applied to huge grid cells. A few parameters in the models are “tuned” to match today’s climate, but Stainforth has show that many choices for parameters can do a good job of reproducing today’s climate. The choices in the IPCC’s models appear arbitrary. Like a Monte Carlo calculation, a proper ensemble explores the range of futures that are compatible know physics by exploring sets of parameters chosen randomly from a range consistent with experiment. Unfortunately, one value for each parameter is used in all of the IPCC’s model runs.
There is also uncertainty about what levels of GHGs will be present in the atmosphere if we follow certain emissions scenarios. For historical runs, there is uncertainty about past anthropogenic forcings and the IPCC allowed each model to chose it preferred forcing history (mis-representing the reliability with which they can reproduce the historical record).
Frank on 26 June 2013 at 1:28 am said:
Thanks, Frank. That’s a horse of a different color. What the Duke guys are doing is using a weighted average of models, with weights based on each model’s proven past performance. As they say:
Sounds like it might have value … but sadly, what the IPCC is doing is an unweighted average of models with no proven past performance at all, and claims the standard deviation is the real-world uncertainty. In such a situation, statistics are useless.
Yes, there are ways to use and combine information that has been proven to be reliable in the past.
The IPCC method of naive model ensembles isn’t one of them …
w.
Willis: Thanks for the reply. The IPCC relies on the big national AOGCMs that member governments tell them to use, and have little choice but to practice “model democracy” – at least until a sponsoring nation admits a model they have funded generously is grossly inferior. IMO they all are performing badly in the pause, but the possibility that the pause is caused by the PDO or other long oscillation means there is no way to tell which model is better than another. If one model starts in an El Nino like state and another starts in a La Nina like state and the planet starts in La Nina, the first model will look worse for at least a decade, no matter which one has the correct ECS.
Judy Curry has written that the IPCC should be saying that climate sensitivity for 2X CO2 is probably between 1 and 6 degC. I assume this comes from ensembles of models that do attempt to explore the full range of parameter space consistent with the known physics and chemistry of the atmosphere. A range that wide is pretty worthless for those advocating for any particular policy except spending lots of money to “buy insurance” against the possibility that climate sensitivity might be high. Poor and developing countries can’t afford such “insurance”, and therefore will negate anything the developed nations chose to do.
Hopefully, attention will turn to energy balance models like the ones Lewis has written about. They are observationally based.
Frank:
There is no such thing as a “correct” ECS. This conclusion follows from the non-observability of the equilibrium temperature.
DAV on 20 June 2013 at 7:54 pm said:
Frank asked: How do we validate a model for the maximum load a bridge can carry?
DAV replied: Engineering projects DO undergo validation, Frank, but the actual answer to your question is: by over design and incorporating safety margins. Even then, some have fallen down but usually for other reasons than improper load calculations. There is a litany of engineering failures when projects have strayed: e.g., the infamous Tacoma Narrows bridge project from the last century is held up as an object lesson in inadequate design.
Not to mention bridge design is a very old art. Climate science is in its infancy. World climate is poorly understood and the current GCMs are simplifications of that.
Until a statistical model is validated it is merely a pretty toy. Why would you want to base your actions on what a toy tells you — particularly an infant’s toy? To do so is, indeed, using it as a veneer.
Frank replies: All engineering models – and AOGCMs – start with the laws of physics, material parameters measured in laboratories (tensile strength, sheer modulus, optical properties of gases, heat capacity, etc.) and parameters measured in the field (soil compaction, wind, solar radiation, optical properties of clouds). AOGCMs are not purely statistical models that can be validated only by experience. From my perspective (and presumably RG Brown), the problem is that the IPCC hasn’t properly assessed how the uncertainty in the parameters they use in their models can combine to produce a much wider range of possible future climates than they currently claim. They pretend that the only uncertainty arises from chaos and modest model-model difference.
You also mention over-design and safety margin, but these always involve compromises that one may not have to make if you trust your model. If you grossly over-design a plane for strength, it won’t be able carry as many passengers. A bridge, with a narrower span is stronger, but the foundation can sink into the soft soil on the edge of a river. There is a podcast from Stanford U with a talk by Schneider where he admits uncertainty in AOGCMs. He presents an analogy between reducing GHG emissions to cope with POSSIBILITY of CAGW and buying home owner’s insurance to cope with the possibility of fire. As long as he sees some possibility of CAGW, he wants us to cut emissions. Of course, he doesn’t mention that the costs of emissions cuts could be much greater than the cost of warming, that the developing world is unlikely to be willing to pay the cost of insuring against CAGW, and a host of other problems.
Terry wrote: There is no such thing as a “correct†ECS. This conclusion follows from the non-observability of the equilibrium temperature.
Frank adds: Likewise, molecules (or atoms or protons or quarks) don’t exist because they aren’t observable. Nor does temperature, which is proportional to the mean kinetic energy of molecules.
Imagine you are God and can make copies of the current earth monitor the temperature before and after you double, quadruple, and halve the CO2 concentration (and keep CO2 fixed at these new levels) on some of these copies. The average temperature rise after doubling will be the ECS. Quadrupling and halving will tell you if temperature varies linearly with the log of the CO2 concentration.
Oops, you don’t even have to be God. In 50 years of so, you can probably experience TCR in person.
Frank:
It sounds as though you are attempting to refute my conclusion of the non-existence of the equilibrium temperature via an argumentum ad absurdum. However, your argument doesn’t fly.
By the definition of terms, an “observable” is a variable whose numerical value can be observed.
* molecules,atoms, protons and quarks are not variables and can be observed
* equilibrium temperatures are variables and cannot be observed.
* temperatures are variables and can be observed.
Thus, of all of the various entities that you use in your argument, only the temperatures are examples of observables. It follows that the implied absurdity does not exist in which I claim molecules, atoms, protons, quarks and temperatures not to exist. It is only the equilibrium temperature which, as a scientific matter, does not exist. It does not exist because, when a value is asserted for it, this assertion cannot be tested.
By the way, your definition of the ECS is faulty. The ECS is the not the average temperature rise after doubling but rather is the equilibrium temperature rise after doubling. (In the acronym “ECS” the “E” stands for “equilibrium.”)