Skip to content

Category: Statistics

The general theory, methods, and philosophy of the Science of Guessing What Is.

September 7, 2008 | 6 Comments

Still a few days left to guess who will win Presidential race

If you haven’t already, please guess who will win the 2008 Presidential race. If you have voted, please do not do so again.

We’ve been running the poll for a couple of days now and have over 500 guesses!

It would be nice to see more diversity in the voting, so if you have friends or colleagues who think the opposite of you, send them this link:

I tried posting this on DemocraticUnderground.com, and I was able to initially, but after one of their members visited my main site, I was banned from making future posts. I was able to post on LiberalForum.org. If anybody knows of other similar places, either post the link or let me know.

Thanks again everybody!

September 6, 2008 | 81 Comments

Do not smooth times series, you hockey puck!

21 October 2011: Welcome Register fans! Comments on this site are always close after 8 days to control spam. To see more about BEST, read this post.

The advice which forms the title of this post would be how Don Rickles, if he were a statistician, would explain how not to conduct times series analysis. Judging by the methods I regularly see applied to data of this sort, Don’s rebuke is sorely needed.

The advice is particularly relevant now because there is a new hockey stick controversy brewing. Mann and others have published a new study melding together lots of data and they claim to have again shown that the here and now is hotter than the then and there. Go to climateaudit.org and read all about it. I can’t do a better job than Steve, so I won’t try. What I can do is to show you what not to do. I’m going to shout it, too, because I want to be sure you hear.

Mann includes at this site a large number of temperature proxy data series. Here is one of them called wy026.ppd (I just grabbed one out of the bunch). Here is the picture of this data:
wy026.ppd proxy series

The various black lines are the actual data! The red-line is a 10-year running mean smoother! I will call the black data the real data, and I will call the smoothed data the fictional data. Mann used a “low pass filter” different than the running mean to produce his fictional data, but a smoother is a smoother and what I’m about to say changes not one whit depending on what smoother you use.

Now I’m going to tell you the great truth of time series analysis. Ready? Unless the data is measured with error, you never, ever, for no reason, under no threat, SMOOTH the series! And if for some bizarre reason you do smooth it, you absolutely on pain of death do NOT use the smoothed series as input for other analyses! If the data is measured with error, you might attempt to model it (which means smooth it) in an attempt to estimate the measurement error, but even in these rare cases you have to have an outside (the learned word is “exogenous”) estimate of that error, that is, one not based on your current data.

If, in a moment of insanity, you do smooth time series data and you do use it as input to other analyses, you dramatically increase the probability of fooling yourself! This is because smoothing induces spurious signals—signals that look real to other analytical methods. No matter what you will be too certain of your final results! Mann et al. first dramatically smoothed their series, then analyzed them separately. Regardless of whether their thesis is true—whether there really is a dramatic increase in temperature lately—it is guaranteed that they are now too certain of their conclusion.

There. Sorry for shouting, but I just had to get this off my chest.

Now for some specifics, in no particular order.

  • A probability model should be used for only one thing: to quantify the uncertainty of data not yet seen. I go on and on and on about this because this simple fact, for reasons God only knows, is difficult to remember.
  • The corollary to this truth is the data in a time series analysis is the data. This tautology is there to make you think. The data is the data! The data is not some model of it. The real, actual data is the real, actual data. There is no secret, hidden “underlying process” that you can tease out with some statistical method, and which will show you the “genuine data”. We already know the data and there it is. We do not smooth it to tell us what it “really is” because we already know what it “really is.”
  • Thus, there are only two reasons (excepting measurement error) to ever model time series data:
    1. To associate the time series with external factors. This is the standard paradigm for 99% of all statistical analysis. Take several variables and try to quantify their correlation, etc., but only with a mind to do the next step.
    2. To predict future data. We do not need to predict the data we already have. Let me repeat that for ease of memorization: Notice that we do not need to predict the data we already have. We can only predict what we do not know, which is future data. Thus, we do not need to predict the tree ring proxy data because we already know it.
  • The tree ring data is not temperature (say that out loud). This is why it is called a proxy. It is a perfect proxy? Was that last question a rhetorical one? Was that one, too? Because it is a proxy, the uncertainty of its ability to predict temperature must be taken into account in the final results. Did Mann do this? And just what is a rhetorical question?
  • There are hundreds of time series analysis methods, most with the purpose of trying to understand the uncertainty of the process so that future data can be predicted, and the uncertainty of those predictions can be quantified (this is a huge area of study in, for example, financial markets, for good reason). This is a legitimate use of smoothing and modeling.
  • We certainly should model the relationship of the proxy and temperature, taking into account the changing nature of proxy through time, the differing physical processes that will cause the proxy to change regardless of temperature or how temperature exacerbates or quashes them, and on and on. But we should not stop, as everybody has stopped, with saying something about the parameters of the probability models used to quantify these relationships. Doing so makes use, once again, far too certain of the final results. We do not care how the proxy predicts the mean temperature, we do care how the proxy predicts temperature.
  • We do not need a statistical test to say whether a particular time series has increased since some time point. Why? If you do not know, go back and read these points from the beginning. It’s because all we have to do is look at the data: if it has increased, we are allowed to say “It increased.” If it did not increase or it decreased, then we are not allowed to say “It increased.” It really is as simple as that.
  • You will now say to me “OK Mr Smarty Pants. What if we had several different time series from different locations? How can we tell if there is a general increase across all of them? We certainly need statistics and p-values and Monte Carol routines to tell us that they increased or that the ‘null hypothesis’ of no increase is true.” First, nobody has called me “Mr Smarty Pants” for a long time, so you’d better watch your language. Second, weren’t you paying attention? If you want to say that 52 out 413 times series increased since some time point, then just go and look at the time series and count! If 52 out of 413 times series increased then you can say “52 out of 413 time series increased.” If more or less than 52 out of 413 times series increased, then you cannot say that “52 out of 413 time series increased.” Well, you can say it, but you would be lying. There is absolutely no need whatsoever to chatter about null hypotheses etc.

If the points—it really is just one point—I am making seem tedious to you, then I will have succeeded. The only fair way to talk about past, known data in statistics is just by looking at it. It is true that looking at massive data sets is difficult and still somewhat of an art. But looking is looking and it’s utterly evenhanded. If you want to say how your data was related with other data, then again, all you have to do is look.

The only reason to create a statistical model is to predict data you have not seen. In the case of the proxy/temperature data, we have the proxies but we do not have temperature, so we can certainly use a probability model to quantify our uncertainty in the unseen temperatures. But we can only create these models when we have simultaneous measures of the proxies and temperature. After these models are created, we then go back to where we do not have temperature and we can predict it (remembering to predict not its mean but the actual values; you also have to take into account how the temperature/proxy relationship might have been different in the past, and how the other conditions extant would have modified this relationship, and on and on).

What you can not, or should not, do is to first model/smooth the proxy data to produce fictional data and then try to model the fictional data and temperature. This trick will always—simply always—make you too certain of yourself and will lead you astray. Notice how the read fictional data looks a hell of a lot more structured than the real data and you’ll get the idea.

Next step is to start playing with the proxy data itself and see what is to see. As soon as I am granted my wish to have each day filled with 48 hours, I’ll be able to do it.

Thanks to Gabe Thornhill of Thornhill Securities for reminding me to write about this.

September 5, 2008 | 21 Comments

Predict who will win the US Presidential Race

When you have a chance, please log on to

and guess who will win the election this year.

This poll closes on 11:50 pm 14 September 2008. No guessing will take place after that.

I am testing the ability of people to guess elections at a point where the amount of information known about each candidate is roughly the same. The site is completely anonymous.

Please do not try and stuff the ballot box by voting more than once. I will not release any results until after the election is over. (I will also remove duplicate records.)

Please tell everybody you know, of every political background, liberal or conservative. Email them the link above. If you can, link to this page on other blogs so that we get as large a sample as possible.

Please, pretty please answer the 6 questions honestly.

Once the election is over, the analysis will appear at this web site.

Thank you very much!

September 4, 2008 | 14 Comments

On journalists and Governor Palin

What is the one thing that will anger a journalist faster than anything else?

Telling him that he is not important.

Last night Governor Sarah Palin said, “I am not going to Washington to seek [journalists’] good opinion.” No line could be more calculated to set off a flurry of fluster and flummery among the elite media. This means war.

She should have done what the other guy did and coddled reporters, sweet-talked them, gave them the precious gift of “access”.

Obama was more savvy. And lo, He gathered them—every major “non-biased” journalist in the country—and brought them on his victory tour of Europe. He gave them then and gives them now minute-by-minute access to his Grand Personage.

Obama’s master move, however, was to tell the media exactly what it wants to hear: “You guys are smart. You know what is right. Your ideas are important.”

See, what happens is something like this. A newly fledged reporter starts covering events. She writes down what has happened at some function so that others can read about it. The events and functions are important, so the reporter begins to feel that she is important. As time passes and more events are covered, our journalist begins to second guess the actions of those on whom she reports. She supports some of those actions, and disapproves of others. The temptation to interdict between the truly important people and her audience becomes overwhelming and she gives in. She begins to editorialize, to selectively include and exclude, and finally to advocate.

Because reporters cover weighty, influential, and serious matters they come to believe that they themselves are weighty, influential, and serious.

The fallacy is obvious.

The reason the media is now so apoplectic in its uncivilized, sexist, and ridiculous attacks on Governor Palin is because of just one thing. Petulance.

The main stream media is having a tantrum. They want to be told again that they are as important as they think they are. They are livid that anybody could not see this and they won’t stop screaming until they get their way.

Is it any wonder, then, that more and more people are switching them off and turning to alternatives?