Statistics

Do not calculate correlations after smoothing data

This subject comes up so often and in so many places, and so many people ask me about it, that I thought a short explanation would be appropriate. You may also search for “running mean” (on this site) for more examples.

Specifically, several readers asked me to comment on this post at Climate Audit, in which appears an analysis whereby, loosely, two time series were smoothed and the correlation between them was computed. It was found that this correlation was large and, it was thought, significant.

I want to give you, what I hope is, a simple explanation of why you should not apply smoothing before taking correlation. What I don’t want to discuss is that if you do smooth first, you face the burden of carrying through the uncertainty of that smoothing to the estimated correlations, which will be far less certain than when computed for unsmoothed data. I mean, any classical statistical test you do on the smoothed correlations will give you p-values that are too small, confidence intervals too narrow, etc. In short, you can be easily misled.

Here is an easy way to think of it: Suppose you take 100 made-up numbers; the knowledge of any of them is irrelevant towards knowing the value of any of the others. The only thing we do know about these numbers is that we can describe our uncertainty in their values by using the standard normal distribution (the classical way to say this is “generate 100 random normals”). Call these numbers C. Take another set of “random normals” and call them T.

I hope everybody can see that the correlation between T and C will be close to 0. The theoretical value is 0, because, of course, the numbers are just made up. (I won’t talk about what correlation is or how to compute it here: but higher correlations mean that T and C are more related.)

The following explanation holds for any smoother and not just running means. Now let’s apply an “eight-year running mean” smoothing filter to both T and C. This means, roughly, take the 15th number in the T series and replace it by an average of the 8th and 9th and 10th and … and 15th. The idea is, that observation number 15 is “noisy” by itself, but we can “see it better” if we average out some of the noise. We obviously smooth each of the numbers and not just the 15th.

Don’t forget that we made these numbers up: if we take the mean of all the numbers in T and C we should get numbers close to 0 for both series; again, theoretically, the means are 0. Since each of the numbers, in either series, is independent of its neighbors, the smoothing will tend to bring the numbers closer to their actual mean. And the more “years” we take in our running mean, the closer each of the numbers will be to the overall mean of T and C.

Now let T' = 0,0,0,...,0 and C' = 0,0,0,...,0. What can we say about each of these series? They are identical, of course, and so are perfectly correlated. So any process which tends to take the original series T and C and make them look like T' and C' will tend to increase the correlation between them.

In other words, smoothing induces spurious correlations.

Technical notes: in classical statistics any attempt to calculate the ordinary correlation between T' and C' fails because that philosophy cannot compute an estimate of the standard deviation of each series. Again, any smoothing method will work this magic, not just running means. In order to “carry through” the uncertainty, you need a carefully described model of the smoother and the original series, fixing distributions for all parameters, etc. etc. The whole also works if T and C are time series; i.e. the individual values of each series are not independent. I’m sure I’ve forgotten something, but I’m sure that many polite readers will supply a list of my faults.

Categories: Statistics

14 replies »

  1. Matt:
    Your explanation makes perfect sense to me and is very clear.

    If I put your comments together with those of other posters ay CA, it seems as though two simple rules apply – (a) you can smooth or apply a filter when your smoothing/filter embodies an hypothesized and defined underlying process, e.g., seasonal adjustment; (b) you can smooth/filter for narrowly construed display purposes but not for calculation purposes. In measuring relationships between T and CO2, (a) amounts to adding a variable or set of variables to a regression equation. If you do not a priori define these variables then you are essentially forcing into the equation variables that can spuriously increase the relationship.

    Is this about right?

  2. If you start with two independent series T & C, then smooth each individually, you get correlations within each series but not between them, because the values of T and C are still independent of each other – you’ve done nothing to relate them.

    Moving each closer to its mean doesn’t make any difference. Your example T=0,0,0,0, and C=0,0,0,0 don’t have correlations. Remember that the cross correlation has the standard deviations of T and C as denominators.

    The post on climate audit was a different issue – the smoothing in T & C yielded what looked like a clear peak in cross-correlation at a specific lag. The post made it clear that this happens with uncorrelated data too, so the issue is the test to apply to a cross-correlation with smoothed data which, even if it existed, would be a worse test than the appropriate one with smoothed data. On this I agree with you: smoothing before statistics is stupid.

  3. Wililam,

    You’re right about the standard deviations in the denominator, which is why I mentioned that you can’t calculate them in classical statistics. Put all statistical theory out of your mind, and anybody would agree that the two sets of numbers, T and C, are certainly strongly related; in fact, they are the same, and therefore “correlated” in the plain English sense of the word.

    Anyway, in Bayesian probability, this can be formally calculated.

    Also, did you notice that, in the Climate Audit post, that the peak correlations always appeared just around (a little before) the number of years used to smooth the data? No coincidence, that.

    Briggs

  4. As I recall, every filter applied to data introduces, in the frequency domain, a frequency component which is characteristic of that partucular filter. If the same filter is applied to two different data sets, both, ‘smoothed’ data sets will now have a common frequency component.

  5. William,

    I disagree with your example. When there is no variability no meaningful correlation is possible. You cannot correlate two flat lines with any degree of certainty. The more ups and downs there are in the data that are correlated, the more meaningful a correlation is.

    Now, in the example at climate audit there just happens to be a correlation which IS meaningful, CO2 really does lag temperature, and the way to find this is to divide the data into many independent consecutive segments and then correlate these with the same lag. If the correlation is physical and not a product of smoothing, then the peak correlation should be at approximately the same lag time for every segment. I.e. the correlations intercorrelate.

    Also, I am fairly certain that it is possible to make some kind of smoothing correlation limit theorem, which states something in the vicinity of the following:

    Given N independent smoothed data points in two data sets there exists some lagged correlation c below which it is impossible to say whether the correlation is real or a statistical artifact.

    Thus, if you have sufficiently many independent (post-smoothed) data points, a sufficiently high correlation at some lag will yield a true relationship, not a spurious correlation.

    For instance, if you had 1 million monthly CO2 and temperature data points, and applied a 10 month running mean to both sets, you will still have 100.000 independent data points to correlate, and a high lagged correlation of the two (e.g. corr=0,9) will necessarily NOT be spurious because over such large data sets the autocorrelation caused by smoothing will even out and cancel.

    I am using an absurdly large data set here to illustrate my point. Correlating means introduces more uncertainties, but that does not mean that no inforamtion can be derived from them.

  6. Onar,

    I would think that anybody not having any training in classical statistics would look at two processes T and C and, finding that they always gave the same output to the same input, would say that they are correlated.

    This is the physical definition of (perfect) correlation, and so is the one used in objective Bayesian or logical probability views of statistics.

    It is also true, like I said, that you cannot calculate variability using constant data (or with, say one or zero data points) in frequentist statistics. But you can in Bayesian.

    You can see we are getting into deep waters with currents that will take us far from our main point.

    So do not let the original T and C approach their limit to T’ and C’ and instead just smooth them: no matter what, the smoothed T and C will be closer to T’ and C’ than the original T and C, and so will show greater correlation, even classically.

    Here is some free and open source R code that can be used to simulate what I mean. The “12” in the running function is the lag length of the running mean. The line library(gregmisc) need only be run once a session (make sure this package in installed first: install.packages('gregmisc')).


    library(gregmisc)
    T = rnorm(100)
    C = rnorm(100)
    cor(T,C)
    Tp = running(T, width=12, pad=TRUE, fun=mean)
    Cp = running(C, width=12, pad=TRUE, fun=mean)
    cor(Tp,Cp)

    Briggs

  7. As an engineer with modest statistics background, my rule is: Avoid smoothing data unless you really, really know what you are doing.

    Speaking of which Dr. Briggs, could you email me? (Email in sig line.) One of my blog readers suggested I contact you so we could look at a few statistics. I suspect he can tell I’m puzzled about what to do with my o-d climate “model” which is nicknamed “lumpy” on CA.

  8. William,

    I don’t dispute that smoothing increases correlation, but the size of the smoothing window relative to the size of the data set must be of importance. Consider the ice core CO2 data which has been naturally physically smoothed by something close to a (weighted) running mean. The smoothing window is somewhere between a few decades and a few centuries. In my wordbook that means that the ice core CO2 data is useless for short term correlation studies, but ok for correlations ranging over hundreds of thousands of years. If smoothing should be avoided at all costs then the ice core CO2 data should be thrown out.

  9. This post is both theoretically incorrect and empirically falsifiable. Generate 1000 random samples of iid Gaussian data each with 1000 points, then look at the sample correlation for various moving windows. They will average to 0, as they should. Proving this is correct is left as an exercise to the reader (hint: it takes 45 seconds and an undergraduate understanding of statistics).

  10. Name,

    Apparently you haven’t even tried your own suggestion. But I will grant you have an undergraduate understanding of statistics.

Leave a Reply

Your email address will not be published. Required fields are marked *