September 6, 2008 | 81 Comments
21 October 2011: Welcome Register fans! Comments on this site are always close after 8 days to control spam. To see more about BEST, read this post.
The advice which forms the title of this post would be how Don Rickles, if he were a statistician, would explain how not to conduct times series analysis. Judging by the methods I regularly see applied to data of this sort, Don’s rebuke is sorely needed.
The advice is particularly relevant now because there is a new hockey stick controversy brewing. Mann and others have published a new study melding together lots of data and they claim to have again shown that the here and now is hotter than the then and there. Go to climateaudit.org and read all about it. I can’t do a better job than Steve, so I won’t try. What I can do is to show you what not to do. I’m going to shout it, too, because I want to be sure you hear.
Mann includes at this site a large number of temperature proxy data series. Here is one of them called wy026.ppd (I just grabbed one out of the bunch). Here is the picture of this data:
The various black lines are the actual data! The red-line is a 10-year running mean smoother! I will call the black data the real data, and I will call the smoothed data the fictional data. Mann used a “low pass filter” different than the running mean to produce his fictional data, but a smoother is a smoother and what I’m about to say changes not one whit depending on what smoother you use.
Now I’m going to tell you the great truth of time series analysis. Ready? Unless the data is measured with error, you never, ever, for no reason, under no threat, SMOOTH the series! And if for some bizarre reason you do smooth it, you absolutely on pain of death do NOT use the smoothed series as input for other analyses! If the data is measured with error, you might attempt to model it (which means smooth it) in an attempt to estimate the measurement error, but even in these rare cases you have to have an outside (the learned word is “exogenous”) estimate of that error, that is, one not based on your current data.
If, in a moment of insanity, you do smooth time series data and you do use it as input to other analyses, you dramatically increase the probability of fooling yourself! This is because smoothing induces spurious signals—signals that look real to other analytical methods. No matter what you will be too certain of your final results! Mann et al. first dramatically smoothed their series, then analyzed them separately. Regardless of whether their thesis is true—whether there really is a dramatic increase in temperature lately—it is guaranteed that they are now too certain of their conclusion.
There. Sorry for shouting, but I just had to get this off my chest.
Now for some specifics, in no particular order.
- A probability model should be used for only one thing: to quantify the uncertainty of data not yet seen. I go on and on and on about this because this simple fact, for reasons God only knows, is difficult to remember.
- The corollary to this truth is the data in a time series analysis is the data. This tautology is there to make you think. The data is the data! The data is not some model of it. The real, actual data is the real, actual data. There is no secret, hidden “underlying process” that you can tease out with some statistical method, and which will show you the “genuine data”. We already know the data and there it is. We do not smooth it to tell us what it “really is” because we already know what it “really is.”
- Thus, there are only two reasons (excepting measurement error) to ever model time series data:
- To associate the time series with external factors. This is the standard paradigm for 99% of all statistical analysis. Take several variables and try to quantify their correlation, etc., but only with a mind to do the next step.
- To predict future data. We do not need to predict the data we already have. Let me repeat that for ease of memorization: Notice that we do not need to predict the data we already have. We can only predict what we do not know, which is future data. Thus, we do not need to predict the tree ring proxy data because we already know it.
- The tree ring data is not temperature (say that out loud). This is why it is called a proxy. It is a perfect proxy? Was that last question a rhetorical one? Was that one, too? Because it is a proxy, the uncertainty of its ability to predict temperature must be taken into account in the final results. Did Mann do this? And just what is a rhetorical question?
- There are hundreds of time series analysis methods, most with the purpose of trying to understand the uncertainty of the process so that future data can be predicted, and the uncertainty of those predictions can be quantified (this is a huge area of study in, for example, financial markets, for good reason). This is a legitimate use of smoothing and modeling.
- We certainly should model the relationship of the proxy and temperature, taking into account the changing nature of proxy through time, the differing physical processes that will cause the proxy to change regardless of temperature or how temperature exacerbates or quashes them, and on and on. But we should not stop, as everybody has stopped, with saying something about the parameters of the probability models used to quantify these relationships. Doing so makes use, once again, far too certain of the final results. We do not care how the proxy predicts the mean temperature, we do care how the proxy predicts temperature.
- We do not need a statistical test to say whether a particular time series has increased since some time point. Why? If you do not know, go back and read these points from the beginning. It’s because all we have to do is look at the data: if it has increased, we are allowed to say “It increased.” If it did not increase or it decreased, then we are not allowed to say “It increased.” It really is as simple as that.
- You will now say to me “OK Mr Smarty Pants. What if we had several different time series from different locations? How can we tell if there is a general increase across all of them? We certainly need statistics and p-values and Monte Carol routines to tell us that they increased or that the ‘null hypothesis’ of no increase is true.” First, nobody has called me “Mr Smarty Pants” for a long time, so you’d better watch your language. Second, weren’t you paying attention? If you want to say that 52 out 413 times series increased since some time point, then just go and look at the time series and count! If 52 out of 413 times series increased then you can say “52 out of 413 time series increased.” If more or less than 52 out of 413 times series increased, then you cannot say that “52 out of 413 time series increased.” Well, you can say it, but you would be lying. There is absolutely no need whatsoever to chatter about null hypotheses etc.
If the points—it really is just one point—I am making seem tedious to you, then I will have succeeded. The only fair way to talk about past, known data in statistics is just by looking at it. It is true that looking at massive data sets is difficult and still somewhat of an art. But looking is looking and it’s utterly evenhanded. If you want to say how your data was related with other data, then again, all you have to do is look.
The only reason to create a statistical model is to predict data you have not seen. In the case of the proxy/temperature data, we have the proxies but we do not have temperature, so we can certainly use a probability model to quantify our uncertainty in the unseen temperatures. But we can only create these models when we have simultaneous measures of the proxies and temperature. After these models are created, we then go back to where we do not have temperature and we can predict it (remembering to predict not its mean but the actual values; you also have to take into account how the temperature/proxy relationship might have been different in the past, and how the other conditions extant would have modified this relationship, and on and on).
What you can not, or should not, do is to first model/smooth the proxy data to produce fictional data and then try to model the fictional data and temperature. This trick will always—simply always—make you too certain of yourself and will lead you astray. Notice how the read fictional data looks a hell of a lot more structured than the real data and you’ll get the idea.
Next step is to start playing with the proxy data itself and see what is to see. As soon as I am granted my wish to have each day filled with 48 hours, I’ll be able to do it.
Thanks to Gabe Thornhill of Thornhill Securities for reminding me to write about this.