Skip to content

Category: Statistics

The general theory, methods, and philosophy of the Science of Guessing What Is.

September 8, 2008 | 47 Comments

Demonstration of how smoothing causes inflated certainty (and egos?)

I’ve had a number of requests to show how smoothing inflates certainty, so I’ve created a couple of easy simulations that you can try in the privacy of your own home. The computer code is below, which I’ll explain later.

The idea is simple.

  1. I am going to simulate two time series, each of 64 “years.” The two series have absolutely nothing to do with one another, they are just made up, wholly fictional numbers. Any association between these two series would be a coincidence (which we can quantify; more later).
  2. I am then going to smooth these series using off-the-shelf smoothers. I am going to use two kinds:
    1. A k-year running mean; the bigger k is, the more smoothing there is’
    2. A simple low-pass filter with k coefficients; again the bigger k is, the more smoothing there is.
  3. I am going to let k = 2 for the first simulation, k = 3 for second, and so on, until k = 12. This will show that increasing smoothing dramatically increases confidence.
  4. I am going to repeat the entire simulation 500 times for each k (and for each smoother) and look at the results of all of them (if we did just one, it probably wouldn’t be interesting).

Neither of the smoothers I use are in any way complicated. Fancier smoothers would just make the data smoother anyway, so we’ll start with the simplest. Make sense? Then let’s go!

Here, just so you can see what is happening, are the first two series, x0 and x1, plotted together (just one simulation out of the 500). On top of each is the 12-year running mean. You can see the smoother really does smooth the bumps out of the data, right? The last panel of the plot are the two smoothed series, now called s0 and s1, next to each other. They are shorter because you have to sacrifice some years when smoothing.

smoother 1 series

The thing to notice is that the two smoothed series eerily look like they are related! The red line looks like it trails after the black one. Could the black line be some physical process that is driving the red line? No! Remember, these numbers are utterly unrelated. Any relationship we see is in our heads, or was caused by us through poor statistics methodology, and not in the data. How can we quantify this? Through this picture:

smoother 1 p-values

This shows boxplots of the classical p-values in a test of correlation between the two smoothed series. Notice the log-10 y-axis. A dotted line has been drawn to show the magic value of 0.05. P-values less than this wondrous number are said to be publishable, and fame and fortune await you if you can get one of these. Boxplots show the range of the data: the solid line in the middle of the box says 50% of the 500 simulations gave p-values less than this number, and 50% gave p-values higher. The upper and lower part of the box designate that 25% of the 500 simulations have p-values greater than (upper) and 25% less than (lower) this number. The outermost top line says 5% of the p-values were greater than this; while the bottommost line indicates that 5% of the p-values were less than this. Think about this before you read on. The colors of the boxplots have been chosen to please Don Cherry.

Now, since we did the test 500 times, we’d expect that we should get about 5% of the p-values less than the magic number of 0.05. That means that the bottommost line of the boxplots should be somewhere near the horizontal line. If any part of the boxplot sticks below above the dotted line, then the conclusion you make based on the p-value is too certain.

Are we too certain here? Yes! Right from the start, at the smallest lags, and hence with almost no smoothing, we are already way too sure of ourselves. By the time we reach a 10-year lag—a commonly used choice in actual data—we are finding spurious “statistically significant” results 50% of the time! The p-values are awful small, too, which many people incorrectly use as a measure of the “strength” of the significance. Well, we can leave that error for another day. The bottom line, however, is clear: smooth, and you are way too sure of yourself.

Now for the low-pass filter. We start with a data plot and then overlay the smoothed data on top. Then we show the two series (just 1 out of the 500, of course) on top of each other. They look like they could be related too, don’t they? Don’t lie. They surely do.

smoother 2 series

And to prove it, here’s the boxplots again. About the same results as for the running mean.

smoother 2 p-values

What can we conclude from this?

The obvious.

BORING DETAILS FOLLOW
Continue reading “Demonstration of how smoothing causes inflated certainty (and egos?)”

September 7, 2008 | 6 Comments

Still a few days left to guess who will win Presidential race

If you haven’t already, please guess who will win the 2008 Presidential race. If you have voted, please do not do so again.

We’ve been running the poll for a couple of days now and have over 500 guesses!

It would be nice to see more diversity in the voting, so if you have friends or colleagues who think the opposite of you, send them this link:

I tried posting this on DemocraticUnderground.com, and I was able to initially, but after one of their members visited my main site, I was banned from making future posts. I was able to post on LiberalForum.org. If anybody knows of other similar places, either post the link or let me know.

Thanks again everybody!

September 6, 2008 | 81 Comments

Do not smooth times series, you hockey puck!

21 October 2011: Welcome Register fans! Comments on this site are always close after 8 days to control spam. To see more about BEST, read this post.

The advice which forms the title of this post would be how Don Rickles, if he were a statistician, would explain how not to conduct times series analysis. Judging by the methods I regularly see applied to data of this sort, Don’s rebuke is sorely needed.

The advice is particularly relevant now because there is a new hockey stick controversy brewing. Mann and others have published a new study melding together lots of data and they claim to have again shown that the here and now is hotter than the then and there. Go to climateaudit.org and read all about it. I can’t do a better job than Steve, so I won’t try. What I can do is to show you what not to do. I’m going to shout it, too, because I want to be sure you hear.

Mann includes at this site a large number of temperature proxy data series. Here is one of them called wy026.ppd (I just grabbed one out of the bunch). Here is the picture of this data:
wy026.ppd proxy series

The various black lines are the actual data! The red-line is a 10-year running mean smoother! I will call the black data the real data, and I will call the smoothed data the fictional data. Mann used a “low pass filter” different than the running mean to produce his fictional data, but a smoother is a smoother and what I’m about to say changes not one whit depending on what smoother you use.

Now I’m going to tell you the great truth of time series analysis. Ready? Unless the data is measured with error, you never, ever, for no reason, under no threat, SMOOTH the series! And if for some bizarre reason you do smooth it, you absolutely on pain of death do NOT use the smoothed series as input for other analyses! If the data is measured with error, you might attempt to model it (which means smooth it) in an attempt to estimate the measurement error, but even in these rare cases you have to have an outside (the learned word is “exogenous”) estimate of that error, that is, one not based on your current data.

If, in a moment of insanity, you do smooth time series data and you do use it as input to other analyses, you dramatically increase the probability of fooling yourself! This is because smoothing induces spurious signals—signals that look real to other analytical methods. No matter what you will be too certain of your final results! Mann et al. first dramatically smoothed their series, then analyzed them separately. Regardless of whether their thesis is true—whether there really is a dramatic increase in temperature lately—it is guaranteed that they are now too certain of their conclusion.

There. Sorry for shouting, but I just had to get this off my chest.

Now for some specifics, in no particular order.

  • A probability model should be used for only one thing: to quantify the uncertainty of data not yet seen. I go on and on and on about this because this simple fact, for reasons God only knows, is difficult to remember.
  • The corollary to this truth is the data in a time series analysis is the data. This tautology is there to make you think. The data is the data! The data is not some model of it. The real, actual data is the real, actual data. There is no secret, hidden “underlying process” that you can tease out with some statistical method, and which will show you the “genuine data”. We already know the data and there it is. We do not smooth it to tell us what it “really is” because we already know what it “really is.”
  • Thus, there are only two reasons (excepting measurement error) to ever model time series data:
    1. To associate the time series with external factors. This is the standard paradigm for 99% of all statistical analysis. Take several variables and try to quantify their correlation, etc., but only with a mind to do the next step.
    2. To predict future data. We do not need to predict the data we already have. Let me repeat that for ease of memorization: Notice that we do not need to predict the data we already have. We can only predict what we do not know, which is future data. Thus, we do not need to predict the tree ring proxy data because we already know it.
  • The tree ring data is not temperature (say that out loud). This is why it is called a proxy. It is a perfect proxy? Was that last question a rhetorical one? Was that one, too? Because it is a proxy, the uncertainty of its ability to predict temperature must be taken into account in the final results. Did Mann do this? And just what is a rhetorical question?
  • There are hundreds of time series analysis methods, most with the purpose of trying to understand the uncertainty of the process so that future data can be predicted, and the uncertainty of those predictions can be quantified (this is a huge area of study in, for example, financial markets, for good reason). This is a legitimate use of smoothing and modeling.
  • We certainly should model the relationship of the proxy and temperature, taking into account the changing nature of proxy through time, the differing physical processes that will cause the proxy to change regardless of temperature or how temperature exacerbates or quashes them, and on and on. But we should not stop, as everybody has stopped, with saying something about the parameters of the probability models used to quantify these relationships. Doing so makes use, once again, far too certain of the final results. We do not care how the proxy predicts the mean temperature, we do care how the proxy predicts temperature.
  • We do not need a statistical test to say whether a particular time series has increased since some time point. Why? If you do not know, go back and read these points from the beginning. It’s because all we have to do is look at the data: if it has increased, we are allowed to say “It increased.” If it did not increase or it decreased, then we are not allowed to say “It increased.” It really is as simple as that.
  • You will now say to me “OK Mr Smarty Pants. What if we had several different time series from different locations? How can we tell if there is a general increase across all of them? We certainly need statistics and p-values and Monte Carol routines to tell us that they increased or that the ‘null hypothesis’ of no increase is true.” First, nobody has called me “Mr Smarty Pants” for a long time, so you’d better watch your language. Second, weren’t you paying attention? If you want to say that 52 out 413 times series increased since some time point, then just go and look at the time series and count! If 52 out of 413 times series increased then you can say “52 out of 413 time series increased.” If more or less than 52 out of 413 times series increased, then you cannot say that “52 out of 413 time series increased.” Well, you can say it, but you would be lying. There is absolutely no need whatsoever to chatter about null hypotheses etc.

If the points—it really is just one point—I am making seem tedious to you, then I will have succeeded. The only fair way to talk about past, known data in statistics is just by looking at it. It is true that looking at massive data sets is difficult and still somewhat of an art. But looking is looking and it’s utterly evenhanded. If you want to say how your data was related with other data, then again, all you have to do is look.

The only reason to create a statistical model is to predict data you have not seen. In the case of the proxy/temperature data, we have the proxies but we do not have temperature, so we can certainly use a probability model to quantify our uncertainty in the unseen temperatures. But we can only create these models when we have simultaneous measures of the proxies and temperature. After these models are created, we then go back to where we do not have temperature and we can predict it (remembering to predict not its mean but the actual values; you also have to take into account how the temperature/proxy relationship might have been different in the past, and how the other conditions extant would have modified this relationship, and on and on).

What you can not, or should not, do is to first model/smooth the proxy data to produce fictional data and then try to model the fictional data and temperature. This trick will always—simply always—make you too certain of yourself and will lead you astray. Notice how the read fictional data looks a hell of a lot more structured than the real data and you’ll get the idea.

Next step is to start playing with the proxy data itself and see what is to see. As soon as I am granted my wish to have each day filled with 48 hours, I’ll be able to do it.

Thanks to Gabe Thornhill of Thornhill Securities for reminding me to write about this.

September 5, 2008 | 21 Comments

Predict who will win the US Presidential Race

When you have a chance, please log on to

and guess who will win the election this year.

This poll closes on 11:50 pm 14 September 2008. No guessing will take place after that.

I am testing the ability of people to guess elections at a point where the amount of information known about each candidate is roughly the same. The site is completely anonymous.

Please do not try and stuff the ballot box by voting more than once. I will not release any results until after the election is over. (I will also remove duplicate records.)

Please tell everybody you know, of every political background, liberal or conservative. Email them the link above. If you can, link to this page on other blogs so that we get as large a sample as possible.

Please, pretty please answer the 6 questions honestly.

Once the election is over, the analysis will appear at this web site.

Thank you very much!