Do not calculate correlations after smoothing data
This subject comes up so often and in so many places, and so many people ask me about it, that I thought a short explanation would be appropriate. You may also search for “running mean” (on this site) for more examples.
Specifically, several readers asked me to comment on this post at Climate Audit, in which appears an analysis whereby, loosely, two time series were smoothed and the correlation between them was computed. It was found that this correlation was large and, it was thought, significant.
I want to give you, what I hope is, a simple explanation of why you should not apply smoothing before taking correlation. What I don’t want to discuss is that if you do smooth first, you face the burden of carrying through the uncertainty of that smoothing to the estimated correlations, which will be far less certain than when computed for unsmoothed data. I mean, any classical statistical test you do on the smoothed correlations will give you p-values that are too small, confidence intervals too narrow, etc. In short, you can be easily misled.
Here is an easy way to think of it: Suppose you take 100 made-up numbers; the knowledge of any of them is irrelevant towards knowing the value of any of the others. The only thing we do know about these numbers is that we can describe our uncertainty in their values by using the standard normal distribution (the classical way to say this is “generate 100 random normals”). Call these numbers C. Take another set of “random normals” and call them T.
I hope everybody can see that the correlation between T and C will be close to 0. The theoretical value is 0, because, of course, the numbers are just made up. (I won’t talk about what correlation is or how to compute it here: but higher correlations mean that T and C are more related.)
The following explanation holds for any smoother and not just running means. Now let’s apply an “eight-year running mean” smoothing filter to both T and C. This means, roughly, take the 15th number in the T series and replace it by an average of the 8th and 9th and 10th and … and 15th. The idea is, that observation number 15 is “noisy” by itself, but we can “see it better” if we average out some of the noise. We obviously smooth each of the numbers and not just the 15th.
Don’t forget that we made these numbers up: if we take the mean of all the numbers in T and C we should get numbers close to 0 for both series; again, theoretically, the means are 0. Since each of the numbers, in either series, is independent of its neighbors, the smoothing will tend to bring the numbers closer to their actual mean. And the more “years” we take in our running mean, the closer each of the numbers will be to the overall mean of T and C.
Now let T' = 0,0,0,...,0 and C' = 0,0,0,...,0. What can we say about each of these series? They are identical, of course, and so are perfectly correlated. So any process which tends to take the original series T and C and make them look like T' and C' will tend to increase the correlation between them.
In other words, smoothing induces spurious correlations.
Technical notes: in classical statistics any attempt to calculate the ordinary correlation between T' and C' fails because that philosophy cannot compute an estimate of the standard deviation of each series. Again, any smoothing method will work this magic, not just running means. In order to “carry through” the uncertainty, you need a carefully described model of the smoother and the original series, fixing distributions for all parameters, etc. etc. The whole also works if T and C are time series; i.e. the individual values of each series are not independent. I’m sure I’ve forgotten something, but I’m sure that many polite readers will supply a list of my faults.
11 comments February 14th, 2008