Doug Keenan was asked by *The Economist* to have a gander at the statistics developed by the Berkeley Earth Surface Temperature (BEST) project. He did so.

We must resist extensive quotation, except to note that Keenan caught the ear of the man himself, Richard Muller, particularly over a dispute of what smoothing time series prior to analysis does to certainty (it increases it unduly). Muller says that there “are certainly statisticians who vigorously oppose this approach, but there have been top statisticians who support it.”

Vigorously oppose it I do. The reasons are laid out by Keenan and by me in the links Keenan provides. Except to agree with Keenan’s critiques—and they are many and fundamental—that’s all I’ll say about the matter here. Instead, I’ll provide my own commentary on the BEST paper “Berkeley Earth temperature averaging process.” These do not duplicate Keenan’s criticisms. I do not attempt simplicity or exhaustiveness.

**My critique**^{1}

I agree with Keenan and say that, while the general point estimate is probably roughly in the ballpark, more or less, plus or minus, but the uncertainty bounds are far too narrow.

The authors use the model:

T(x,t) = θ(t) + C(x) + W(x,t)

where x is a vector of temperatures at spatial locations, t is time, &theta() is a trend function, C() is spatial climate (integrated to 0 over the Earth’s surface), and W() is the departure from C() (integrated to 0 over the surface or time). It is applied only over land. Not all land, but most of it. The model becomes probabilistic by considering how an individual measurement d_{i}(t_{j}) relies on trend θ(), W(), a “baseline” measure, plus “error.” By this, they do not mean measurement error, but the standard model residual. Actual measurement error is not accounted for in this part of the model.

The model takes into account spatial correlation (described next) but ignores correlation in time. The model accounts for height above sea level but no other geographic features. In its time aspect, it is a singly naive model. Correlation in time in real-life is important and non-ignorable, so we already know that the results from the BEST model will be too sure. That is, the effect of ignoring correlation in time is to produce uncertainty bounds which are too narrow. How much too narrow depends on the (spatial) nature of the time correlation. The stronger it is in realty, the more the model errs.

Kriging (a standard practice which comes with its own set of standard critiques with which I assume the reader is familiar) is used to model non-observed spatial locations. Its correlation function is a fourth-order polynomial of distance (eq. 14). A fourth-order, yes. Smells like some significant hunting took place to discover this strange creature. Its average fit to a mass of blue dots (Fig. 2) appears well enough. But stop to consider those dots. The uncertainty around the fit is huge. **Update: correction**: I meant to write exponential of a fourth-order polynomial. The rest of the criticism stands.

This is important because the authors used a fixed correlation function with plug-in estimates. They say (p. 12) that “further refinements of the correlation function are likely to be a topic of future research.” The problem is that their over-certain estimate will cause the certainty in the final model results to be overstated. No Bayesian techniques were harmed during the creation of this model, but it would have better if they had been.^{2} The uncertainty in this correlation absolutely needs to be accounted for. Since the mass of blue dots (Fig. 2) have such an enormous spread, this uncertainty is surely not insignificant. Stop and understand: this correlation function was assumed the same everywhere, at every point on the Earth’s surface, an assumption which is surely false.

**Update: correction**: I mean this last statement to be a major criticism. If you have a mine in which at various spots some mineral is found and you want to estimate the concentration of it in places where you have not yet searched, but which are places inside the boundaries of the places you have searched, kriging is just the thing. Your mine is likely to be somewhat homogeneous. The Earth’s land surface is not homogeneous. It is, at the least, broken by large gaps of water and by mountains and deserts and so forth. To apply the same kriging function everywhere is too much of a simplification (leading to over-certainty).

About measurement error (p.15), the authors repeat the common misconception, “The most widely discussed microclimate effect is the potential for ‘urban heat islands’ to cause spuriously large temperature trends at sites in regions that have undergone urban development.” This isn’t poor statistics, but bad physics. Assuming the equipment at the stations is functioning properly, these trends are not “spurious”. They indicate the actual temperature that is experienced. As such, these temperatures should not be “corrected.” See this series for an explanation.

To account for one aspect of *estimated* measurement error, the authors develop an approach on which they bestow a great name: the “scalpel.”

Our method has two components: 1) Break time series into independent fragments at times when there is evidence of abrupt discontinuities, and 2) Adjust the weights within the fitting equations to account for differences in reliability. The first step, cutting records at times of apparent discontinuities, is a natural extension of our fitting procedure that determines the relative offsets between stations, encapsulated by &bcirc;

_{i}, as an intrinsic part of our analysis.

It’s not clear how uncertainties in this process carry through the analysis (they don’t, as near as I can tell). But the breaking-apart step is less controversial than the “outlier” weighting technique. There are no such things as “outliers”: there is only real data and false data. A transposition error, for example, is false data. Inverting the sign for a temperature is false data. Very large or small observations in the data may or may not be false data. There are a huge number of records and all can’t be checked by hand without substantial cost. Some process that estimates the chance that a record is false is desirable: those points with high suspicion can be checked by hand. No process is perfect, of course, especially when that process is for historical temperature measurements.

**Update** A change in a station siting does not introduce a “bias” in that station’s records. *It becomes a new station.* See the temperature homogenization series for more about this.

The authors did do some checking; e.g. they remove truly odd values (all zeros, etc.), but this cleaning appears minimal. They instead modeled temperature (as above) and checked the given observation against the model. Those observations that evinced large deviations from the model were then down-weighted and the model re-run. The potential for abuse here is obvious, and is the main reason for suspicion of the term “outlier.” If the data doesn’t fit the model, throw it out! In the end, you are left with only that data that fits, which—need I say it?—does not prove your model’s validity. No matter what, this procedure will narrow the final model’s uncertainty bounds. The authors claim that this down-weighting process was “expected” to effect about 1.25-2.9% of the data.

The next step in “correcting” the data is more suspicious. They say, “In this case we assess the overall ‘reliability’ of the record by measuring each record’s average level of agreement with the expected field &Tcirc;(&xtilde; , t) at the same location.” At least reliability is used with scare quotes. Once again, this has the direct effect of moving the actual observations towards the direction of the model, making the results too certain.

Are the results on pages 24 and 25 register all the actual changes? It’s not clear. Dear Authors: what percentage of data was effected, taking account of the raw data removal, scalpel, outlier down-weighting, and reliability down-weighting? 5%, 10%, more? And for what time periods was this most prevalent?

In Section 9, Statistical Uncertainty, I am at a loss. They take each station and randomly assign it to

one of five groups, n = 1, 2, …, 5, and say “This leads to a set of new temperature time series

hat-θ_{n}(t_{j})….As each of these new time series is created from a completely independent station network, we are justified in treating their results as statistically independent.” I have no idea what this means. The five series are certainly not independent in the statistical sense (not in space or time or in sample).

The procedure attempts to estimate the uncertainty of the ** estimate** hat-θ

_{n}(t

_{j})—i.e. the parameter and not the actual temperature. Treating the samples as independent will cause this uncertainty to be underestimated. But leave all this aside and let’s move to what really counts, the uncertainty in the model’s final results.

Fig. 4b is slightly misleading—I’m happy to see Fig. 4a—in that, say, in 1950 85% of the Earth’s surface was not covered by thermometers. This is coverage in terms of model space, not physical space. This is proved by Fig. 4a which shows that physical space coverage has decreased. But let that pass. Fig. 5 is the key.

This is *not* a plot of the actual temperature and it is *not* a plot of the uncertainty of the actual temperature. It is instead a plot of the *parameter* hat-θ_{n}(t_{j}) and the uncertainty given by the methods described above. Users of statistics have a bad, not to say notorious, habit of talking of parameters as if they were discussing the actual observables. Certainty of the value of a parameter does not translate into the certainty of the observable. Re-read that last sentence, please.

From p. 26:

Applying the methods described here, we find that the average land temperature from Jan 1950 to Dec 1959 was 8.849 +/- 0.033 C, and temperature average during the most recent decade (Jan 2000 to Dec 2009) was 9.760 +/- 0.041 C, an increase of 0.911 +/- 0.042 C. The trend line for the 20th century is calculated to be 0.733 +/- 0.096 C/century, well below the 2.76 +/- 0.16 C/century rate of global land-surface warming that we observe during the interval Jan 1970 to Aug 2011. (All uncertainties quoted here and below are 95% confidence intervals for the combined statistical and spatial uncertainty). [To avoid HTML discrepancies, I have re-coded the mathematical symbol “+/-” so that it remains readable.]

Note the use for the word “temperature” in “we find that the average land temperature…” etc., where they should have written “model parameter.” From 1950 to 1959 they estimate the parameter “8.849 +/- 0.033 C”. Question to authors: are you sure you didn’t mean 8.84*8* +/- 0.033 C? What is the point of such silly over precision? Anyway, from 2000 to 2009 they estimate the parameter as 9.760 +/- 0.041 C, “an increase of 0.911 +/- 0.042 C.” **Update**: this criticism will be unfamiliar to most, even to many statisticians. It is a major source of error (in interpretation); slow down to appreciate this. See this example: the grand finale.

Accept that for the moment. The question is then why choose the 1950s as the comparator and not the 1940s when it was *warmer*? Possible answer: because using the 1950s emphasizes the change. But let’s not start on the politics, so never mind, and also ignore the hyper precision. Concentrate instead on the “+/- 0.033 C”, which we already know is *not* the uncertainty in the actual temperature but that of a model parameter.

If all the sources of over-certainty which I (and Keenan) mentioned were taken into account, my guess is that this uncertainty bound would at least double. That would make it at least +/- 0.066 C. OK, so what? It’s still small compared to the 8.849 C (interval 8.783 – 8.915 C; and for 2000-2009 it’s 9.678 – 9.842 C). Still a jump.

But if we added to that the uncertainty in the *parameter* so that our uncertainty bounds are on the actual temperature, we’d again have to multiply the bounds by 5 to 7^{3}. This makes the 1950-1959 bound at least 0.132, and the 2000-2009 at least 0.410. The intervals are then 8.519 – 9.179 C for the ’50s and 9.350 – 10.170 C for the oughts. Still a change, but one which is now far less certain.

Since the change is still “significant”, you might say “So what?” Glad you asked: Look at those bounds on the years *before* 1940, especially those prior to 1900. Applying the above changes pushes those bounds way out, which means we cannot tell with any level of certainty if we are warmer or cooler now then we were before 1940, and especially before 1900. Re-read that sentence, too, please.

And even if you want to be recalcitrant and insist on model perfection and you believe parameters are real, many of the uncertainty bounds before 1880 already cover many modern temperatures. The years around 1830 are already not “statistically different” than, say, 2008.

An easier way to look at this is in Fig. 9, which attempts to show the level of uncertainty through time. All the numbers in this plot should be multiplied by at least 5 to 10. And even after that, we still haven’t accounted for the largest source of uncertainty of all: the model itself.

Statisticians and those who use statistics never or rarely speak of model uncertainty (same with your more vocal sort of climatologist). The reason is simple: there aren’t cookbook recipes that give automatics measures of this uncertainty. There can’t be, either, because the truth of a model can only be ascertained externally.

Yet all statistical results are conditioned on the models’ truth. Experience with statistical models shows that they are often too sure, especially when they are complex, as the BEST model is (and which assumes that temperature varies so smoothly over geography). No, I can’t prove this. But I have given good reason to suspect it is true. You may continue to believe in the certainty of the model, but this would be yet another example of the triumph of hope over experience. What it means is that the uncertainty bounds should be widened further still. By how much, I don’t know.

**Update**Neither the BEST paper nor my criticisms say word one about *why* temperatures have changed. Nobody nowhere disputes that they *have* changed. See this discussion.

——————————————————————————————

^{1}“Say, Briggs. You’re always negative. If you’re so smart, why don’t you do your own analysis and reveal it?” Good question. Unlike the BEST folks, and others like my pal Gav, I don’t have contacts with Big Green, nor do I have a secretary, junior colleagues, graduate students, IT people, fancy computer resources, printer, copier, office supplies, access to a library, funds for conference travel, money for page charges, multi-million dollar grants, multi-thousand dollar grants, nor even multi-dollar grants. All the work I’ve ever done in climatology has been *pro bono*. I just don’t have the time or resources to recreate months worth of effort.

^{2}The authors used only classical techniques, including the jackknife. They could have, following this philosophy, bootstrapped the results by resampling this correlation function.

^{3}This is what experience shows is the difference for many models. For the actual multiplier, we’d have to re-do the work. As to that, see Note 1.

The statistics on the BEST paper were designed with the assistance of the eminent (no sarcasm) David Brillinger, though he does not appear as a co-author. Charlotte Wickham was the statistician and was a student of Brillinger’s. “Charlotte Wickham is an Assistant Professor in the Department of Statistics at Oregon State University. She graduated with her PhD in Statistics from the University of California, Berkeley, in 2011.”

I’m not going to claim that I understand everything that was written above. Still, a few things really caught my eye.

When they say that they look for areas where there are discontinuities and then fit a new curve.. This sound an awful lot like machine learning. Gaussian mixture regression does the same thing, and support vector regression as well. In each case the machine learning algorithm attempts to describe the curve using as few ‘reference points’ (constants) as possible. What makes theses methods different than what was discussed is that they are both well documented.

The other thing is the selection criteria for outliers. My question is this: why do it by hand? Why not create a set of rules and apply said rules to all values? My own data analysis tool does this automagically and is even nice enough to document the whole process. One could even make the argument that outliers should be left in as-is, and that the modelling process should be robust enough to deal with them.

It seems like many of these models are now in the land of voodoo; a land usually dominated by artificial intelligence and machine learning.

Why not just go all they way and use regression trees or neural networks? At least then the models would be far easier to describe, understood by a great many, and they could do nice simple tests like ‘forecast skill’ and ‘validation testing’. They could ditch significance and start reporting how much better the model predicts values versus ‘guessing the mean’ or some other ridiculously simple test. No p values, but it does give you a percentage with 1 being the maximum, and -infinity being the minimum.

Some machine learning techniques, like boosted regression trees and random forests, are capable of handling incomplete records: this would mean they could forget having to find typos and other such errors in the record.

There are some formatting problems and what appear to be missing figures. An example of formatting is:

==== The authors use the model:

====

==== nbsp; nbsp; nbsp; T(x,t) = Î¸(t) + C(x) + W(x,t)

====

==== where x is a vector of temperatures at spatial locations, t is time,

==== &theta() is a trend function …

The “nbsp; nbsp; nbsp” is undoubtedly an attempt at HTML formatting gone wrong. The ampersand-“theta()” is another. Theta is correctly displayed in the line with the “nbsp”s.

The following sentence is found near the end of the article:

==== An easier way to look at this is in Fig. 9, which attempts to show

==== the level of uncertainty through time.

Of all the figures (of which there should be at least nine), only figure 5 is visible.

I love it when they calculate the temperature to a thousandth of a degree. You know this is phoney precision. Many years ago I was working on a project studying RF propagation and we had to record the wet and dry bulb temperatures at the beginning and middle of our shift. They wanted us to record the temperatures to a tenth of a degree and we did, but it was just a guess. The thermometers weren’t big enough to have tenth of a degree graduations so we would estimate. We made jokes about having calibrated eyeballs.

Colonial,

Fixed some of the HTML errors; others depend on the browser. Look to the paper for the original notation and the figures; I only copy Fig. 5.

Matt

On the basis of your post, would it be true to say that you regard Brillinger and both eminent

andway wrong wrt the statistical approach taken by BEST?“

About measurement error (p.15), the authors repeat the common misconception, â€œThe most widely discussed microclimate effect is the potential for â€˜urban heat islandsâ€™ to cause spuriously large temperature trends at sites in regions that have undergone urban development.â€ This isnâ€™t poor statistics, but bad physics”You’re dead-on right, Matt. It’s bad measurement physics, and typifies the blindness that seems universal among the compilers of surface temperature time series. Maybe the genes for poor measurement physics are dominantly linked to those for an urge to compile temperatures.

The BEST paper takes zero note of the (relatively few) papers that discuss the serious systematic measurement errors caused by local climate effects. The papers by Ken Hubbard and Xiamao Lin, University of Nebraska, Lincoln, are most definitive, but there’s also the corroborating works of H. Huwald, et al. (2009) “Albedo effect on radiative errors in air temperature measurements” Water Resources Research 45, W08431; 1-13; doi:10.1029/2008WR007600 and Genthon, et al. “Atmospheric temperature measurements biases on the Antarctic plateau” Journal of Atmospheric and Oceanic Technology 2011; doi: 10.1175/JTECH-D-11-00095.1

These systematic measurement errors are different from the urban heat island effect, and seriously impact the accuracy of the surface air temperature record. No one has ever done the calibration experiment to determine how this error distributes itself in time or space. And yet, it’s universally ignored, and studies proceed to their conclusions of certainty as though systematic error doesn’t exist.

The modeled precision of (+/-)0.001 C is more than just a joke, and more serious than the painful statistical irony you pointed out, Matt. It’s an indictment of scientific negligence against the BEST scientists.

A likely reason all those surface air temperature folks so consistently, persistently, and universally ignore systematic surface air temperature measurement error is that if they did pay attention to it, they’d have nothing left to talk about, nothing to argue over, and nothing to justify their livelihoods.

Fall-out effects of noticing the systematic errors of surface air temperature measurement include that climate modelers would have nothing against which to calibrate their models, and paleotemperature calculators (apart from being ascientific) would have to deal with errors bars of a size disallowing any of their (specious) conclusions about past vs present air temperatures.

The surface air temperature record is the lynch-pin of the whole apparatus of AGW climate so-called science.

Why do they use a 10 year moving average?

A simple moving average gives equal weight to all the data points. We live in the present those points are the most important and likely the most accurate. Why don’t they use 5 years instead or 3 or whatever? What is the justification of 10? The last few years have not been increasing in temperature. We have record co2 levels (over the last 50 some years) but the global temp has been rather stable for the last 13 years or so. To me using a 10 year moving average hides important data. Kenaan touched on this lightly.

There is something I must admit I do not understand at all; they have two independent measurements, Tmin and Tmax. They add these two together and then decide by 2 two. Why?

Do not the Tmin and Tmax contain differential information about the kinetics and thermodynamics of the atmosphere where they are measured?

I asked one the the Medics I work with why they didn’t do the same transformation with blood pressure reading; why not average Systolic and Diastolic pressures, I asked?

He thought I was mad. Yet Tmin and Tmax averaging is not thought of as normal, it is not thought of at all.

Doc

1) because it provides a value which can be compared to other similar values, and the models are based on it. it has a currency in climate circles because the relationship of minimum to maximum temperatures varies on a latitudinal and seasonal basis (think hours of daylight) and average temperature comparison helps minimize that variation.

2) yes, there is a lot of information contained in max and minimum temperatures. one of the fun things about the observed warming trend is that it is comprised more of increase in minimum temperature than in maximum (which doesn’t accord well with the modeled behavior of the climate due to increased atmospheric carbon dioxide). there are a variety of culprits posited for this phenomena and a search for ‘diurnal temperature range’ in the climate literature would be useful if you are interested.

while minimum and maximum temperatures mean something, that doesn’t necessarily mean they are important. if the greater increase in minimum temperature is an indirect effect of the same cause as the increase in maximum temperature then an increase in average temperature is a better metric than either minimum temperature or maximum temperature on a climate scale. of course maximum and minimum could also signify different causes and using average temperature would be a fool’s errand. spend a couple of decades studying the physics of minimum and maximum temperatures and then let me know whether they are more important than average temperatures, otherwise we are both confined to studying other people’s guesswork.

Argh, should clarify that last sentence by pointing out that you would have to do the studying because no one previously has spent enough time examining how the subject to have come up with a conclusive answer. there are people who know enough to make guesses, but they are still educated guesses.

yeah…the averaging to get t avg obscures most of the “science”…we do not want the “science” to get too real do we!

This paper seems too empirical, more appropriately,

ad hocto me. However, I understand that the problems are quite challenging. Letâ€™s assume the models in equations [1], [2], [4], and [22] are acceptable, and letâ€™s forget all the band aid solutions such as scalpel and outlier weighting, all with pitfalls but agood start,to the problems of poor data quality and stability of the computing algorithm for now. If the authors can first show that the two proposed modifications of Kriging stated in [12] and [14] (and therefore [23]) can actually produce good estimates/interpolations, then Iâ€™d say there might be statistical merit in this paper.I hope the authors will receive good suggestions from referees and improve the paper accordingly.

I am disturbed by the use, and apparent acceptance, of the scalpel in an attempt to derive clean trends (theta, in the above model) from the data.

My reasoning is based upon Fourier Analysis and Information content.

Global warming is a very low frequency signal, 0.5 to 10 cycles per century if you want to express warming as an average 0.15 deg C per decade.

The suture as described is a LOW-CUT, High Pass filter. It preserves high frequency, but eliminates low frequency from the signal.

Somehow, I am forced to believe that a concerted process that fractures a temperature record into a bunch of pieces, which eliminates all low frequency information from the pieces, can somehows be sutured together and magically trustworthy low-frequency information reappears in the fourier spectrum.

Time Domain

original record “C” is split into two pieces “A” and “B”

Orig: CCCCCCCCCCCCCCCCCCCC

into: AAAAAAA

____________BBBBBBBBBBBBBBB

They can only be assembled by using a U function that says how A and B are to mate.

restored: A(t) + U(t) + B(t)

where U is how the USER thinks they should be sutured.

If you look at the fourier spectrum, the parts will be populated conceptually like this:

_UU

UUUB_BBBB

UUUBBBBAAB

UUUUABAAAA

UUUUUAAAAAB

UUUUUUAAAAA

The only low frequency that exists in the sutured, averaged signal, comes from the U component, not from either of the fragments.

Am I missing some aspect of their scalpel and suture process where by the low frequency of the data was actually preserved and not created whole cloth post suture? If so, I’m all ears to learn.

Thanks for your very prompt analysis and interesting comments. A Statistician to the Stars, indeed!

You and others will want to have a look at Steve McIntyre’s first cut at this,

http://climateaudit.org/2011/10/22/first-thoughts-on-best/

As always, SM has a very sharp eye.

In particular, both Roman Mureika and Carrick comment on the smoothing issue, at http://climateaudit.org/2011/10/22/first-thoughts-on-best/#comment-307249

and following. RomanM’s takeaway advice is, â€œDonâ€™t try this at home without consulting competent help.â€

My own statistical chops are too rusty to be helpful here, but I note that David Brillinger has done some well-regarded work on time series analysis — which he perhaps passed on to his student, Charlotte Wickham, who’s the statistician co-author on the BEST statistical report.

It will be most interesting to see how this plays out. And thanks for your many pro-bono efforts for statistical sanity over the years!

Cheers — Pete Tillman

—

“Gentlemen, you can’t fight in here — this is the War Room.”

— Dr. Strangelove (Stanley Kubrick)

P.S. to the above concerning Fourier Analysis.

A fundamental theorem holds that the lowest frequecy in a time digitized signal is directly inversly proportional to the length of the time signal.

Therefore, cutting a time series into shorter segments can only raise the lowest frequency remaining in the segmented data.

So when you suture the segments back together,

Is the lowest frequencies zero,

or are they non-zero and if so, where did they come from if not an artifact from the tool that did the suture?

Say what? Whatever happened to a good old exponential function? It’s safe, dependable, a bit worn and frayed aroud the edges but basically a good model. Why introduce multiple variables when you don’t require them?

Stephen,

The cuts are done because the error is believed to be a step function across the cut. Cutting the data here loses information, but most of what it loses is the error. When stitched back together, the low frequency data comes from what remains combined with the assumption that the underlying temperatures vary smoothly across the gap.

It’s like interpolation. In principle, you have no data from between the points, so there’s no way to tell what it’s value might be. It could be a spike or a brief wiggle or the complete works of William Shakespeare written very small. But if you have a good exogenous reason for thinking the data is suitably smooth, (i.e. nothing above the Nyquist frequency) then you can fill in the gap with an estimate. It’s not an artefact of the tool, it’s an artefact of the smoothness assumption.

The breaks constitute a set of step functions, and part of the error is assumed to be a linear combination of them. The observed values are the true values plus some member of this subspace. The part of the observed data perpendicular to that subspace is not affected by those errors, and contains the low frequency information presented in the final result.

Or to put it another way, the linear combination of step functions affects both low frequencies and high frequencies in a very constrained way – the same information is inherent in both; aliased, you could say, from low to high. By looking at the high frequencies (the sudden jumps in the data), you can partially reconstruct what effect they must have had on the low ones.

I wonder if I could make what I believe is a serious point that I have arguing on other blogs?

I have a feeling that statisticians are generally uncomfortable with signals and statistical signal processing – which is not meant as a criticism.

There has been considerable argument about the application of linear models to data that has been “smoothed”. Having worked in biomedical signal processing (and did my PhD in a bio-signal processing lab 30 years ago), this is controversy that has raged for years.

Why do you want to smooth a signal? In doing so, you are making a judgement about the significance of different frequency components in the signal. This depends on the hypothesis you wish to test. My reaction is that one should state a priori, the hypothesis cast in signal processing terms and should then design the DSP operations so that data is fit to test that hypothesis. Smoothing, is of course a filter, and most climate scientists tend to use running averages. This is a filter with a very poor response, which again depends on what you are trying to achieve. If you smooth and then decimate the data, you can run into considerable problems by aliasing and irretrievably corrupting the time series and create false harmonics in the data.

http://judithcurry.com/2011/10/18/does-the-aliasing-beast-feed-the-uncertainty-monster/

When faced with this problem, which I grant may not be important if one is looking at very low frequencies, many would respond with statements along the lines of:”I think this unimportant”…. My response is that you don’t know how important it is until you have analysed the problem.

As regards using tests on smoothed data, say you have two noisy (Gaussian) signals, and you want to compare their means, there is a wide body of theory. Linear operations,ie filtering, will attenuate (or amplify) the power spectrum and hence the variance of the signal. The power is a chi squared variable with 2 degrees at each harmonic, resulting in a negative exponential distribution at each harmonic. The equivalent degrees of freedom can be calculated allowing one to use statistical measures on the filtered signal. This is of course equivalent to using the signal autocorrelation function. I don’t find the problem of statistical testing of parameters of filtered signals particularly disturbing as long as it has been correctly analysed. It becomes more difficult when the signal distribution is decidedly non-gaussian; have grappled with measuring signal from within the heart that have interference from x-ray machines, which has a horrible distribution and can be highly non-stationary. Development of an empirical distribution through simulation is helpful under these circumstances.

A related problem is Kriging, which makes asumptions about smoothness. I see Nullus in Verba has said what I was going to say but I do think that analysis in the spatial frequency domain is helpgul in understanding what it is doing to the data, because in its simplest form Kriging can be recast as a filter.

My impression is that the concepts of signal processing add to the analysis of time series. They do not invalidate what has been done, but they may give a better understanding of how the data has been manipulated.

I completely agree with you about parameters not being observables and the distinction is very important.

In my day job, we run econometric models on a daily basis and it is one of my tasks to specify those models (and to run estimates when the data warrants an update to existing equations).

Now, any decent econometrics text book will tell you that at the end of the day, your regression results reflect the greatest probability of the relationships in the data per specification and equation type: nothing more, nothing less. That favorite of beginners estimating data is r^2, which ideally is up the .90s, but in the real world people should be happy when it’s greater than 0.6 or so, as that tells you, basically, that the equation you’ve specified is also able to explain, to a better degree than simple accidental correlation, whatever it is you’re trying to explain.

If you take noisy time series and smooth them, be it by moving averages, trend components from seasonal adjustment, and other smoothing techniques, you will always end up with a vastly better r^2, but with vastly worsened ability to explain what you want to explain. Simply put, the relevance of what appears to be noise may, in fact, be data that falsifies what you are trying to say. It can only be considered a cardinal sin to use smoothed time series when running any sort of regression analysis.

In cases of really noisy data, specification becomes the critical task for econometricians, not estimation. Oh, and poor results from estimation attempts also tells you what something is probably not: this is one of the major driving forces behind folks trying to prove x when the raw data fails to support it cleanly and properly. For me, using smoothed time series just reeks of desperation to prove something in the face of data that doesn’t support the conclusions.

That said: why oh why hasn’t someone actually run seasonal adjustment – preferably X11, rather than X12, since the latter is too heavily dependent on VAR estimation parameters to be useful to anyone but a seasonal adjustment fanatic – on temperature data? That would eliminate outliers in a meaningful manner and provide, over time, a clearer picture of what temperature development may well be. I know a former colleague of mine invited ridicule for doing exactly that for 30 years’ worth of monthly data and generating a completely stationary time series (for the Canton of Basel, Switzerland).

Oh, and it just occurred to me why not: the standard package for seasonal adjustment from Canadian Census is limited to 30 years’ of data. But this is merely a technical constraint. The code is out there (albeit Cobol and C, both of which are tedious to decipher for those not properly introduced to the mysteries) and could be modified…

Nullis, thanks for the reply. However I do not buy it and I ask that you reconsider a couple of points.

By looking at the high frequencies (the sudden jumps in the data), you can partially reconstruct what effect they must have had on the low ones.The sudden jumes in the data are near or above the Nyquest frequency, which means you cannot use their frequency content at all. In earthquake seismology, can you reconstruct the low frequency Love and Releigh waves by looking at the P and S? No.

Suppose we take a simple saw-tooth function:

T(t) = A(t) + B*([(t+c) mod 365]) + Er(t)

Where T is temp, t is time in days,

A(t) is the GW signal, the Temperature anomaly if you will, very low frequency,

for simplcity of example, treat A(t) as a constant A

B is a seasonal overprint, and Er(t) is random daily noise or error.

The Expected value of T(t) looking at the entire 1-piece wave is A + B(365/2) with a theta = dT/dt of 0.

There is a discontinuity at t = 365-c. If a scalpel is to be used, it will be here. (Why anywhere else?). For each segment, the slope (theta) dT/dt = B.

It is a fundamental assumption in the BEST process that the absolute temperatures cannot be trusted, but the trends within a station and between stations are where to place your trust. But the use of the scalpel alters the trends. In the sawtooth example, the real trend is 0, the post scalpel trend is B.

I agree with you that

the breaks constitute a set of step functions. I completely loose you atThe observed values are the true values plus some member of this subspace.I think the problem is that the step contains not only error, but all the low frequency data content and it is impossible to tell what is error and what is signal.RC Saumarez,

It is an excellent question and I’ll give you a

sketchof an answer here. I love signal processing and am entirely comfortable with it. A filter is a measurement-error model. To have such a model means having knowledge of the probabilistic structure of the noise. This knowledge is often available, as it is in electronics where there is good theory for why the noise is this and such.An electronic signal or a temperature (and barring error in the measurement apparatus) is what is actually experienced. In the case of an electronic signal, there is knowledge that it is corrupted by noise which has to be removed. You take the experience and manipulate it (filter it) to create a new experience. But where is the noise in the temperature? Objects feel the temperature as it is, and not what is “behind” it.

Now, if you think that there is some “generating process” that lurks behind the temperature and you know what that generating process looks like (physically, I mean), then you can fit a model to the actual temperature to discover things about that process. You can use this model to predict new temperatures. But the model cannot change what objects actually felt.

What is the process/signal for generating temperature? Well, the (disputed) physics of that are what make up GCMs. Filtering can be used on temperature to reveal phenomena in line with these models, of course, and as such is useful. But filtering that is purely statistical and not-physically based is suspect.

And it is still true that the more smoothing that is applied to time series and after that smoothing any modeling is done, the results will be too certain. This is just as fact. Example: Take two random sets of numbers, smooth them and correlate them. The more smoothing, the higher the correlation.

This, as I said, is too sketchy (and too hurriedly written). I hope to clarify this soon.

Response to DocMartyn.

Weather stations do not measure the average temperature. Usually they measure the max and min temps, the extremes. The curve fitters take the median temperature and call it average, which it is obviously not.

I am delighted to learn that Dr. Briggs is experienced in signal processing. I used to work in the DSP section of a major corporation 30 years ago. I am a bit rusty and out of date so I have a question for whoever can answer. Isn’t taking the monthly average of temperatures the same as using a transversal filter on the data? The transversal filter is described by its’ Z transform. In the old analog TV receivers transversal filters were used to comb filter the video signal to separate the luma and chroma signals. Wouldn’t there be a similar combing effect on the temperature data? Does it matter?

Mr. Briggs,

A filter is a measurement-error model? Care to take the challenge of explaining it to me?

Ohâ€¦ I thought I understood the paper just fine. LOW-CUT? High Pass filter? Hmmm… ??? Here is my understanding about homogenization and scalpel.

Letâ€™s say that by law an office building thermostat is to be set at 68Â°F during the winter. Presumably, then, the mean temperature of the building is 68 Â°F. The Facility Management would regularly collect temperature (temp) data to check whether the thermostat is working properly.

** Homogenization **

I have this forever-important seniority that gets me a small office with not only a nice window view but also a heating/AC vent right above my spacious desk. So my office tends to be warmer. Yes, I can feel the warmth physically.

Assume there is a systematic measurement error, e.g., the temp readings in my office are systematically 2Â°F above those measured in other offices because of the vent. The number â€œ2â€ is estimated based on the temp readings from neighboring offices. In this case, subtracting 2Â°F from the temp data ascertained from my office may be more accurate.

The adjustment of a constant wonâ€™t affect the cyclical pattern. Of course, there is the issue of how to adjust the systematic errorâ€¦ which can get complicated and may affect the cyclical pattern (if any)… and hence Mr. Stephen Rasey’s may be relevant.

** Scalpel **

Once the facility management office discovers the problem of the temp readings in my office, instead of incorporation the data from my office as

reliable data, they decide to downweight them. In other words, the temp data from my office are cut loose from the rest of the data when used to assess whether the thermostat is working correctly. No modification of data, just downweight them.Hand-wavy explanations, but if you want rigorous explanations, do check out the papers cited in the paper.

Oh… another hand-wavy explanation.

** Averaging reduces error **

Ask 30 people to measure the length of your foot (x). Due to whatever reason, some will obtain a length larger than x (positive error) and some smaller (negative error). Presumably, those positive and negative measurement errors will cancel each other out when calculating the average of those 30 measurements. So, the average will probably give you the most accurate measurement (i.e., less error) of your foot length.

Stephen,

The reconstruction is based on the data satisfying the error model. If you present it data that doesn’t satisfy the model, of course it will give the wrong answer. The error model here is a linear combination of step functions, with the steps shifted to the breakpoints between station moves or instrument changes.

Take as your data:

T(t) = A(t) + S(t) + E(t) + Sum_i k_i H(t – t_i)

where A(t) is your climate signal, say a slow rise or a sinusoid with multi-decadal period,

S(t) is the seasonal variation, a sinusoid with period 1 year,

E(t) is a (small) measurement error – uniform quantisation error for example,

k_i is a (large) constant offset for each step, positive or negative,

t_i are the times of the discontinuities, well separated,

H(t) is the step function at zero.

Now chop the data into segments at the times t_i, and then line each segment up with its neighbours to give a smooth line. Does the result look anything like A(t)+S(t)? Where could the low frequency information A(t) have come from, given that you chopped the data up into short segments?

If you’re trying to say that the procedure doesn’t give an identity – that it only works if the data happens to satisfy the smoothness assumptions and error model – I agree. If you’re trying to say that it is in principle impossible to genuinely reconstruct low frequency information from piecing together short segments, even when you can rely on additional constraints on the data, I don’t. I may be misunderstanding the point you’re trying to make.

Correction to 10:44am.

There are Multiple discontinuities at t where ((t+c) mod 365) = 0. If a scalpel is to be used, it will be here. (Why anywhere else?). For each segment, the slope (theta) dT/dt = B.

wsbriggs

You comment as to why the data started in 1950 rather than the warmer 1940’s.

I have written several articles noting that as Giss started in 1880 they therefore measured from a trough, rather than the peak immediately prior to this date, which accentuated the rise. Its a curious start date as the number of stations at that time was not sufficient to give anything like a global coverage, nor produced consistent data as the Stephenson screen (for example) did not come into universal use until decades later.

One possible motive was that many US records started from that time, but a genuine global record coud not be attempted before at least 1950. Why 1880?

tonyb

Nillus,

I think we are close to agreement. The key point is that the low frequency content of the reconstructed signal has to come from

additional constraintsnot present in the splices. That’s part of the U(t) user input which must come from some objective data. What would that be except purposeful estimation of the discontinuities?I think that is the step where the extra information is necessary. What is it? Where is it? Other than in the information the suture left on the cutting room floor?

It would seem to me to be critical to analyze the distribution of the step:

Sum_i k_i H(t â€“ t_i)

in the processed data. There may be a great deal of arthopogenic temperature effects in that data…. temperature effects caused by the human wielding the scalpel.

Thank you for your reply. A filter is not necessarily a measurement error knowledge procedure. One can use it to separate two wholly detministic processes that operate at distinct frequencies.

If you are going to use it to reduce errors, it assumes that you know the distribution and harmonic structure of the noise. Using a low pass filter assumes that the noise is high frequency, and by implications the low frequencies are not noise. This depends on one’s model of the errors.

I’m sorry , I cut myself off. You are obviously correct in what what you say, but my impression is that one has to consider the low frequency content of possible noise in temperature signals as well as the physical model and how this might arise in the temperature record. One thing I would like to have seen is a more formal error analysis, in the engineering sense of modelling the srror process, as opposed to assuming it, in the BEST analysis to see how this affects error limits. This could be approached through simulation.

Having said this, I am not “denying” the BEST study, but I not convinced that the error limits in the earlier part of the record are correct, because it does not seem to me that sampling problem has been tied down. This part of the record is, in my view, important because it is central to the null hypothesis that the trends in the recent record are indistinguishable from those in the past.

Briggs,

I need to cut and paste this blog and its comments to a Word document so that I can

read it over and over and use its content to teach young engineers about the importance

of questioning your data.

Is that allowed?

RC Saumarez,

Just a comment from a practicing electrical engineer.

This is actually more intended to make me feel good than as a response to your comments.

Forgive me.

Thermal noise is your friend.

It is the most reliable signal you have.

If you input thermal noise into a system and do not get the expected result – find out why.

If you want to find “features” in your computer model – simulate noise and see if it

comes out of your model as noise.

Bill S,

Without wanting to step on Briggs’ toes, yes, my understanding is you can pretty much do whatever you want with this post, as long as you give proper attribution and do not misrepresent what Briggs has said. By posting it on the web, Briggs has decided to make his post available, for free, to anyone, anytime, anyplace in the world. Further, due to the Wayback Machine, even if he later decides to change his post — too late — anyone can still find it. He has no expectation of preventing broad distribution or and no expectation of profits from what he wrote (although I’m sure he’d love a huge tip, if anyone is inclined!). You can copy the entire post and critique or praise it, paragraph-by-paragraph, sentence-by-sentence, word-by-word. The key is to make sure you are attributing the writing to Briggs.

Edit note: “The authors did not some checking” seems incoherent.

About smoothing and messy signals, an interesting biological observation:

The brain likes messy. The reason digital tones (e.g., phone rings, beeps of all kinds, etc.) are hard to locate (even for cats, with their moveable ears) is that there’s not enough info. The brain detects arrival lags between ears by matching the messy harmonics etc. of individual waveforms. If they’re missing, so is directional data.

Here’s a de facto denial of the BEST study to chew on:

http://wattsupwiththat.com/2011/10/24/unadjusted-data-of-long-period-stations-in-giss-show-a-virtually-flat-century-scale-trend

Response to Ray:

Wow, a reference to transversal filters – that takes me back. I believe that a transversal filter is basically an FIR filter (finite impulse response). The ‘smoothers’ that these statistics folks talk about appear to be FIR filters with all coefficients equal to unity, or a “box car” filter.

The combing effect is due to the ‘zeroes’ in the frequency response of the filter. The FIR filter implements only zeroes (no useful poles).

Does applying a smoother to temperature data have a combing effect? Yes. Does it matter? Well, assume you have years of temperature data consisting of weekly samples. If you apply a smoother that averages 52 adjacent samples in time, the underlying yearly variations in temperature are completely removed from the output series. Also completely removed are any underlying cyclic variations with a repetition rate of twice a year, three times a year, four times a year, etc.

Assume instead you apply a smoother that averages 26 adjacent samples. The underlying yearly variations are still in the output data, except attenuated to about 2/3 of the input value.

And, of course, if each output sample is the average of the previous n input samples, you’ve introduced a time delay of n/2 samples, which would only matter when comparing to another time series, I guess. Or your could create the output sample as the average of the the previous n/2 samples and the following n/2 samples to eliminate the time delay.

I think that unlike you, though, I am a bit disconcerted by the discussion drift into signal processing. I supposedly should be a lot more comfortable discussing signal processing than statistics, but about all it has done is confuse me. For example, Briggs says “Take two random sets of numbers, smooth them and correlate them. The more smoothing, the higher the correlation.” This is not at all intuitive to me. For example, if I start with two independent, uncorrelated white noise sampled-time series, and smooth each of them into two independent pink noise series, the cross correlation between the two output series will still be zero, it seems to me. Hopefully I’m getting hung up on a difference in terminology and that my signal processing intuition isn’t that badly broken. (I do see that the auto-correlation of each output series taken individually will change, widening from the single non-zero value at zero delay to some peaked shape centered at zero delay, but I don’t think this is what Briggs was talking about – widening the auto-correlation peak is simply a restatement of the definition of smoothing.)

@ Briggs.

I’ve been cogitating about the problem overnight.

1) In the temperature signal, there is actual temperature of the air at the thermometer and errors. These errors are due to miscalibration, quantisation i.e.: nearest deg F, drift (which may include changing the instrument).

2) The temperature itself is variable. This is property of the system, as you point out.

3) Averaging data in this context is a statement that the short term variability is unimportant. In also assumes that the low frequency component is independent of the high frequency variability. This is a statement that we can treat the processes underlying temperature as a linear combination of a trend that is imposed on the short term variability. If we average the signal we have a slowly varying signal, with excursions. As you point out, this is a property of the actual temperature, but becomes treated as an “error” and gets lumped in with true errors. Averaging makes this “error” normally distributed. I do not think that this is a correct way of looking at the problem, but it makes the problem far more tractable in terms of using linear statistical models.

4) Further errors arise if the variability is incorrect represented. My instinct is that is more important in the spatial domain as opposed to the time domain and aliasing may be (almost certainly) present, especially in the earlier part of the reconstruction This is important in the BEST model because the spatial weighting function is assumed to be calculable from the cross correlation between time domain signals. There is no guarantee that this is physically correct because of a) sampling problems. b) It is assumed that the temperature stations capture all processes driving temperature, and c) it is not obvious why one would expect a linear relationship between the temperatures at physically distant stations ( A cold front passes over one station and then may strengthen or dissipate as it passes over other stations, which is highly non-linear. “That doesn’t matter, it will all come out in the wash”. I don’t know if matters – it needs to be analysed as to whether it is linearisable)

5) The assumption behind a mean temperature is that this is a measure of the total energy in the system. It is a construction and a formalisation that does not capture the underlying process and is thermodynamically incorrect unless the system is at equilibrium. As you say, one would ideally have a true mathematical model of the process generating temperature. In this case, one might not even use mean temperature as a description of that model, because the climate is not even in a steady state, and is an intrinsic variable. This reflects what I regard as an important problem: if you write an equation using a variable in a system, you have made an assumption about how the system works, and the way you analyse the data should reflect those assumptions. In other words does the analysis actually make sense in terms of a physical model?

6) Is there a half way stage between the globally averaged temperature and a fully accurate mathematical model? The key, I think, is to characterise the tempero-spatial variabilty in the data, rather than reducing it to “noise” or “error”. The spatial variabilty of US temperature trends suggests that this an important component of the temperature signal. I would be inclined, if I were active in this field, is to make a probabilistic model that exposes global spatio-temporal variability as a continous function, although the data may be insufficient to do this. Simulation of different station densities and the analysis process might give a more nuanced understanding of the errors in calculating mean temepratures.

@ Milton Hathaway.

Your comments about the zeros of an averager( i.e.: a filter that has a rectangular impulse response) are well made. Hence my comments about averaging a signal being a lousy filter – any filter applied to a signal should be designed for the purpose of a specific analysis.

The statistical effects of smoothing can be illustrated as follows. You have two time series of broadband, Gaussian noise with the same variances. You wish to test if their means are different. Quite reasonably, you could do a t-test. You then low pass filter these signals. If you calculate the distribution of the t_statistic, it will be wrong. This is because the samples are no longer independent. You have performed a convolution so that each sample will be a weighted sum of the samples surrounding it.

If you were to correlate these signals, you would expect a CC of zero, which you would get with an infinitely long sequence. In practice, r would be distributed about zero. As you reduce the bandwidth of a LPF on these signals, the distribution of r will become wider, while still having a mean of zero. Intuitively this is because the variation of each sample has become restrained. Suppose you limited the signal to one harmonic. The correlation between these two harmonics would range between 1, i.e. in phase to minus -1 – in antiphase. This is merely a statement that a random signal has ideally a flat amplitude spectrum and a random phase distribution and therefore the correlation coefficient must vary between +1 and -1, while in a broad band signal, the deviation depends on the vector sum (i..amplitude and phase) of all the harmonic, which should ideally be zero, but for a signal of finite length will be close to it.

As a comment, one has to think carefully about phase in the sttistical analysis of signals.

Mr. Briggs,

Have you considered publishing your comments? Together with Mr. Keenan perhaps, or independently from him? It looks like the BEST project is going to be used in a lot of places as an argument in the climate mitigation debate. A valid criticism of the certainty levels of the results would be an important contribution.

Grzegorz Staniak,

Good question. I notified the BEST people of this critique and they have already acknowledged it.

It would be difficult to publish this critique because the paper which is involved hasn’t been published either, and is just available on a web page (like this).

Hi, R C Saumarez,

A correlation coefficient falls between -1 and +1 simply because of its ingenious mathematical definition. ^_^

I donâ€™t need more explanations on spectrum/Fourier analysis and errors-in-variables. Not at all!!! However, would you please point out why the low/high pass filtering is relevant to the models and methods employed in this paper? It annoys me because I fail to see the connection.

Dear Mr. Briggs,

The paper has stated that their sampling methods tend to underestimate the true errors. The authors already pointed out many pitfalls in their methods. I think what the authors want is constructive suggestions, not for us to reiterate those problems. Donâ€™t you agree?

I recall that BEST’s four papers have been submitted to GRL.

It is perhaps too much to hope that the editors of this journal have the open-mindedness to invite a certain WM Briggs to be a reviewer.

Is it worth suggesting to BEST (if not too late) that you’re happy for you name to be put forward in this capacity?

JH,

(1) Which of my criticisms would you say is not constructive?

(2) What do you say of Muller’s claim that global warming is thus proved by BEST’s results, in his sense the physical theories underlying man-made temperature change are thus true? (See my previous day’s essay.)

@JH. I’m sorry, having read your post, I thought you were asking why filtering affected correlation. Having re-read it it still appears to be asking that.

If you have ordered data in space or time and one performs a linear operation on them that involves interaction between adjacent samples, this can be treated as a signal processing operation that has a particular mathematical framework. In general, filtering reduces the number of statistical degrees of freedom when one is considering the data in terms of a distribution. Therefore there is an intimate relation between signal processing operations and the statistical properties of the data. With paired data, one can perform a correlation between them. If they are normally distributed independent observations, one can interpret this. If the independence between samples is lost, by filtering, the variance of the distribution of a correlation coefficent between two signals is increased. This will naturally reflect itself in the ACFs and CCF, since the correlation coeffient is a power normalised version of the CCF. This is easily demonstrated analytically for signals with a gaussian distribution

Kriging is a method of spatial interpolation that is potentially aliased and must reduce the number of DoF in the spatial domain. It can also be shown to be a spatial low pass filter. Given the uncertainties in the data, although Kriging is a linear technique, the transformation between the true spatial data set and the “Kriged” version may be unpredictable and non-linear. The argument is about how this affects the confidence limits of the estimate in temperature.

I agree that the BEST authors acknowledge that sampling problems increase uncertainty. The question is the reliability that this places on statistical estimates and how you look at the problem.

If you are going to consider the statistical properties of varying functions, that have been manipulated signal processing operations by averaging and by interpolation,it is therefore natural to ask how this affects statistical inference. In many fields, this has been studied extensively, but in this application, the effects are less clear. Nevertheless, the question of inference from the “error limits” of the reconstruction is very important in comparing different epochs.

Mr. Briggs,

Well, I think the authors would welcome any suggestions! This paper is interesting because it tries to offer solutions to solve problems. Although there are many issues that need be further investigated as admitted by the authors themselves.,which makes it even more interesting!

My main interest has always been how and why a solution works. Not…

what has Muller said? Is it important or just some very-ignorable hot air?

RC Saumarez,

Thanks for your answer. How does the modified method of Kriging work is yet to be studied.

BTY, the correlation used in Kriging is modeled as a function of distance. no computation of correlations between two smoothed series involved.

Though I asked for an explanation of the statement â€œa filter is a measurement error model,â€ I really wasnâ€™t looking for answers. I was being mean to Mr. Briggs! ^_^

Thank you, you clearly want to make an argument for the sake of having an argument.

Muller wrote this in the WSJ:

“Global warming is real. Perhaps our results will help cool this portion of the climate debate. How much of the warming is due to humans and what will be the likely effects? We made no independent assessment of that.”

Briggs even quoted this, piecemeal, in his article on the WSJ editorial.

So I don’t know what Briggs is talking about, when he cites “Mullerâ€™s claim that global warming is thus proved by BESTâ€™s results, in his sense the physical theories underlying man-made temperature change are thus true? ”

Muller said nothing about ‘man-made temperature change’ other than that the degree to which the temperature change is man-made , was not addressed by the BEST study.

Talk about wanting to make an argument for argument’s sake!

Dear RC Saumarez,

No, no argument for the sake of having an argument when it comes to statistics. Sorry, you are wrong, just as you are wrong in saying that I asked how filtering affects correlation!

Instead of saying that “a filter is a measurement error model is nonsense” immediately, I was sending a message to Mr. Briggs that I am willing to hear his explanations… only because I know there are a lot to learn even though I have done research in measurement error/errors-in -variables models for many years.

I guess being polite is not necessary a good thing.

William Briggs

I would be interested in your evaluation of how well the BEST papers address not just the Type A (statistical) errors, but also the Type B (bias) errors. I am curious why there appears to be so little use of the international guidelines for reporting uncertainty, especially separately reporting the Type A and Type B errors and the expanded uncertainty. Is that only applied at the national lab level?

See:

NIST Technical Note 1297 1994 Edition, Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results,

Barry N. Taylor and Chris E. Kuyatt

http://www.nist.gov/pml/pubs/tn1297/index.cfm

http://physics.nist.gov/Pubs/guidelines/TN1297/tn1297s.pdf

@JH. What you said was very specific: You could not see how filtering a signal from white to pink noise increased correlation. You hoped that your signal processing intuition wasn’t broken.

I do do not agree with the statement that filtering a pair of random signals increases the correlation. It increases the variance of the estimated value of “r”.

The idea that it must increase certainty ignores the fact that the estimated power spectra of white noise and an impulse are identical (Think about their ACFs) and that their difference lies in the phase spectra. The properties of correlation is a phase dependent effect, which resides in the CCF.

I simply provided a simple explanation for this.

Mr. Briggs,

What do you think of Grant Foster’s critique of Keenan’s comments:

http://tamino.wordpress.com/2011/10/23/fake-skeptic-criticism-of-decadal-variations-in-the-global-atmospheric-land-temperatures/

and his assessment of one of the BEST papers:

http://tamino.wordpress.com/2011/10/24/decadal-variations-and-amo-part-i/

in the light of your article above?

@JH,

I apologise. I responded to Milton Hathaway, and your response seemed to be in my response, and I thought (through pure stupidity on my part) that I was having a conversation with MH.

JH

Following our cross-threaded conversation. Here is a different response, it also impinges on Briggs’ comments.

I agree that a filter can be viewed as signal-error model. The improvement obviously depends on the relative spectra of the noise and the signal. If they overlap, you will have an effect on both the noise and the signal. If you have multiple records with a known fiducial marker for the signal, noise can be reduced selectively by coherent averaging, which can be viewed as a noise specific filter.

Personally, I wouldn’t describe a filter in that way. A (linear) filter is a system that has gain and phase at a particular frequency and there is a wide application of filter theory that doesn’t deal with noise, although practically noise may raise its ugly head, depending on what one is trying to do. I think the confusion may arise from the autocorrelation function which doesn’t have phase, in which case the concept of a signal error model is reasonable.

I do not agree that smothing increases correlation. It increases the variance of the correlation and does not shift the mean give one spurious increased certainty in estimation of correlation. (I had a bit of a double-take at this comment and actually did a simulation to convince myself that this is the case).

I think this misconception may arise, again, from the use of ACFs. If you simply look at these, one might conclude that smoothing might invariably increase correlation. However, the cross-correlation spectrum is the quotient of the individial spectra multiplied by the cross phase spectrum. Therefore you have to think of correlation of times series as complex functions when applying filters to the signals. As I pointed out earlier, if you, in principle, smooth two random signals to simply one harmonic, the correlation coefficient between them will vary between +1 and -1, depending on their relative phases.

Following from this, if you filter two signals with different phase responses, you will completely mangle the correlation between them.

Apologies for the cross threaded response.

Dear RC Saumarez,

You are right. Smoothing does’t necessarily increase the correlation! A counter example can be easily obtained by a simulation.

Sorry, busy day today.

Do they really try to estimate daily temperature by averaging the min and max? Giving equal weight to the min and max? What if it is hot for 1 hour in the afternoon and cold for the remaining 23 hrs. does this create the same amount of world famine as if the temp were hot for 12 hrs? sheesh.

Reanalysis of the global temperature statistics is interesting. But how about statisticians thinking a bit harder about the core of the problem in the first place:

* if one wanted to measure “global temperature”, where should one measure (are airports good places?)

* what sort of global thermometer grid might we use to be valid (should it have biasd coverage over land)

* should we measure air temperature of solid/liquid surface temperature or somewhere else (are “land surface” temperatures in air comparable to “sea surface” temperatures under water)

What we have presently is a reanalysis of a shambles. What we need is a properly considered scientific and statistical approach to the whole subject.

@B Louis

You cannot measure “global temperature”, leaving aside for a moment its definition. But the question of the reliability of stations due to their locations and distribution has been addressed a number of times, from Hansen to BEST. And the main temperature series use algorithms which reduce “bad” stations influence in the long term. It could be seen on a few occassions that, on the one hand, raw data show the same pattern as the adjusted series, and on the other hand, that “good” stations actually show more warming than the bad ones. Also, satellite measurements show a rising trend too, even if a smaller one.

As for sea surface temperatures, they’re closely and directly related to the temperatures of the air masses above them. And more useful in research.