Skip to content
September 9, 2008 | 24 Comments

Some more reasons why I should be in charge

I want to mourn, too

The man who can rightly be called the Father of Statistics, Ronald Aylmer Fisher, while an incredibly bright man, showed that all of us are imperfect when he repeatedly touted a ridiculously dull idea. Eugenics. He figured that you could breed the idiocy out of people by selectively culling the less desirable. Since Fisher also has strong claim on the title Father of Modern Genetics, others—all with advanced degrees and high education—agreed with him. We now recognize that his musing was stupid. But we can sympathize with him, can we not? For example, Fisher might have had in mind people like those in the video below, sent to us by the folks at The Chilling Effect.

Star Trek Food

Have you ever seen Star Trek, the original series I mean? Every now and then Kirk would hunger (not for flesh) and he would eat—little bright cubes of what could only be called nutrition. Food would be far too strong a word. The cubes looked like packages of solidified, neon-colored Jello. No doubt Fleet Academy scientists worked very hard at creating the optimal balanced diet to preserve health and vigor. They even gave you the choice of Soylent Green, or Soylent Orange…….no, wait, that’s another movie where scientists created nutrition for the rest of us.

Even out of television and movies, the scientists and physicians are at it. They are, even now, constantly looking out for our health. Elizabeth M. Whelan, over at the New York Post, yesterday wrote an article describing a plea from two New York City Busybody Physicians that the government regulate food consumption. The plea came in the form of a peer-reviewed paper in the Journal of American Medical Association. The zealous authors Lynn Silver, MD, MPH ,and Mary T. Bassett, MD, MPH (to instill awe, note the letters after their names) open their manifesto with

[T]he most rapidly growing food-related threat to health today is not microbes, but overconsumption of calories, sugar, salt, and unhealthy fat.

According to Whelan (who runs the site HealthFactsAndFears.com)

Specifically, the doctors call on government to take immediate emergency action to force the food industry to make “healthier” food, including placing hefty taxes on fare they deem unhealthy – thus contributing to the already soaring price of food.

They reject government guidelines and education as “relatively weak interventions” and argue that “stronger actions are needed immediately to reduce obesity, hypertension, heart disease and other chronic ills.”

…They strictly divide “healthy” from “unhealthy” foods and suggest we follow Britain’s lead by placing green symbols on healthy food and red ones on the “bad” stuff….They argue that “the ubiquity of food [has become] treacherous” and that food should be regulated like alcohol and cigarettes, “putting reasonable limits on where and how [food] can be sold . . . amending zoning [to] limit the number or density of locations selling unhealthy foods in restaurants, vending machines and other outlets.” (emphasis on “treacherous” mine)

I really don’t know what Whelan is carping about. Scientists have MDs and PhDs, they have more education than you do, their peers regularly review each other’s work so they make no mistakes, they are smarter than you and they know more than you possibly could. This is all the reason we need, is it not, for them to exercise authority over us? Since I am one of these eminences, I can see no good reason why I should not be placed in charge immediately.

September 8, 2008 | 47 Comments

Demonstration of how smoothing causes inflated certainty (and egos?)

I’ve had a number of requests to show how smoothing inflates certainty, so I’ve created a couple of easy simulations that you can try in the privacy of your own home. The computer code is below, which I’ll explain later.

The idea is simple.

  1. I am going to simulate two time series, each of 64 “years.” The two series have absolutely nothing to do with one another, they are just made up, wholly fictional numbers. Any association between these two series would be a coincidence (which we can quantify; more later).
  2. I am then going to smooth these series using off-the-shelf smoothers. I am going to use two kinds:
    1. A k-year running mean; the bigger k is, the more smoothing there is’
    2. A simple low-pass filter with k coefficients; again the bigger k is, the more smoothing there is.
  3. I am going to let k = 2 for the first simulation, k = 3 for second, and so on, until k = 12. This will show that increasing smoothing dramatically increases confidence.
  4. I am going to repeat the entire simulation 500 times for each k (and for each smoother) and look at the results of all of them (if we did just one, it probably wouldn’t be interesting).

Neither of the smoothers I use are in any way complicated. Fancier smoothers would just make the data smoother anyway, so we’ll start with the simplest. Make sense? Then let’s go!

Here, just so you can see what is happening, are the first two series, x0 and x1, plotted together (just one simulation out of the 500). On top of each is the 12-year running mean. You can see the smoother really does smooth the bumps out of the data, right? The last panel of the plot are the two smoothed series, now called s0 and s1, next to each other. They are shorter because you have to sacrifice some years when smoothing.

smoother 1 series

The thing to notice is that the two smoothed series eerily look like they are related! The red line looks like it trails after the black one. Could the black line be some physical process that is driving the red line? No! Remember, these numbers are utterly unrelated. Any relationship we see is in our heads, or was caused by us through poor statistics methodology, and not in the data. How can we quantify this? Through this picture:

smoother 1 p-values

This shows boxplots of the classical p-values in a test of correlation between the two smoothed series. Notice the log-10 y-axis. A dotted line has been drawn to show the magic value of 0.05. P-values less than this wondrous number are said to be publishable, and fame and fortune await you if you can get one of these. Boxplots show the range of the data: the solid line in the middle of the box says 50% of the 500 simulations gave p-values less than this number, and 50% gave p-values higher. The upper and lower part of the box designate that 25% of the 500 simulations have p-values greater than (upper) and 25% less than (lower) this number. The outermost top line says 5% of the p-values were greater than this; while the bottommost line indicates that 5% of the p-values were less than this. Think about this before you read on. The colors of the boxplots have been chosen to please Don Cherry.

Now, since we did the test 500 times, we’d expect that we should get about 5% of the p-values less than the magic number of 0.05. That means that the bottommost line of the boxplots should be somewhere near the horizontal line. If any part of the boxplot sticks below above the dotted line, then the conclusion you make based on the p-value is too certain.

Are we too certain here? Yes! Right from the start, at the smallest lags, and hence with almost no smoothing, we are already way too sure of ourselves. By the time we reach a 10-year lag—a commonly used choice in actual data—we are finding spurious “statistically significant” results 50% of the time! The p-values are awful small, too, which many people incorrectly use as a measure of the “strength” of the significance. Well, we can leave that error for another day. The bottom line, however, is clear: smooth, and you are way too sure of yourself.

Now for the low-pass filter. We start with a data plot and then overlay the smoothed data on top. Then we show the two series (just 1 out of the 500, of course) on top of each other. They look like they could be related too, don’t they? Don’t lie. They surely do.

smoother 2 series

And to prove it, here’s the boxplots again. About the same results as for the running mean.

smoother 2 p-values

What can we conclude from this?

The obvious.

BORING DETAILS FOLLOW
Continue reading “Demonstration of how smoothing causes inflated certainty (and egos?)”

September 7, 2008 | 6 Comments

Still a few days left to guess who will win Presidential race

If you haven’t already, please guess who will win the 2008 Presidential race. If you have voted, please do not do so again.

We’ve been running the poll for a couple of days now and have over 500 guesses!

It would be nice to see more diversity in the voting, so if you have friends or colleagues who think the opposite of you, send them this link:

I tried posting this on DemocraticUnderground.com, and I was able to initially, but after one of their members visited my main site, I was banned from making future posts. I was able to post on LiberalForum.org. If anybody knows of other similar places, either post the link or let me know.

Thanks again everybody!

September 6, 2008 | 81 Comments

Do not smooth times series, you hockey puck!

21 October 2011: Welcome Register fans! Comments on this site are always close after 8 days to control spam. To see more about BEST, read this post.

The advice which forms the title of this post would be how Don Rickles, if he were a statistician, would explain how not to conduct times series analysis. Judging by the methods I regularly see applied to data of this sort, Don’s rebuke is sorely needed.

The advice is particularly relevant now because there is a new hockey stick controversy brewing. Mann and others have published a new study melding together lots of data and they claim to have again shown that the here and now is hotter than the then and there. Go to climateaudit.org and read all about it. I can’t do a better job than Steve, so I won’t try. What I can do is to show you what not to do. I’m going to shout it, too, because I want to be sure you hear.

Mann includes at this site a large number of temperature proxy data series. Here is one of them called wy026.ppd (I just grabbed one out of the bunch). Here is the picture of this data:
wy026.ppd proxy series

The various black lines are the actual data! The red-line is a 10-year running mean smoother! I will call the black data the real data, and I will call the smoothed data the fictional data. Mann used a “low pass filter” different than the running mean to produce his fictional data, but a smoother is a smoother and what I’m about to say changes not one whit depending on what smoother you use.

Now I’m going to tell you the great truth of time series analysis. Ready? Unless the data is measured with error, you never, ever, for no reason, under no threat, SMOOTH the series! And if for some bizarre reason you do smooth it, you absolutely on pain of death do NOT use the smoothed series as input for other analyses! If the data is measured with error, you might attempt to model it (which means smooth it) in an attempt to estimate the measurement error, but even in these rare cases you have to have an outside (the learned word is “exogenous”) estimate of that error, that is, one not based on your current data.

If, in a moment of insanity, you do smooth time series data and you do use it as input to other analyses, you dramatically increase the probability of fooling yourself! This is because smoothing induces spurious signals—signals that look real to other analytical methods. No matter what you will be too certain of your final results! Mann et al. first dramatically smoothed their series, then analyzed them separately. Regardless of whether their thesis is true—whether there really is a dramatic increase in temperature lately—it is guaranteed that they are now too certain of their conclusion.

There. Sorry for shouting, but I just had to get this off my chest.

Now for some specifics, in no particular order.

  • A probability model should be used for only one thing: to quantify the uncertainty of data not yet seen. I go on and on and on about this because this simple fact, for reasons God only knows, is difficult to remember.
  • The corollary to this truth is the data in a time series analysis is the data. This tautology is there to make you think. The data is the data! The data is not some model of it. The real, actual data is the real, actual data. There is no secret, hidden “underlying process” that you can tease out with some statistical method, and which will show you the “genuine data”. We already know the data and there it is. We do not smooth it to tell us what it “really is” because we already know what it “really is.”
  • Thus, there are only two reasons (excepting measurement error) to ever model time series data:
    1. To associate the time series with external factors. This is the standard paradigm for 99% of all statistical analysis. Take several variables and try to quantify their correlation, etc., but only with a mind to do the next step.
    2. To predict future data. We do not need to predict the data we already have. Let me repeat that for ease of memorization: Notice that we do not need to predict the data we already have. We can only predict what we do not know, which is future data. Thus, we do not need to predict the tree ring proxy data because we already know it.
  • The tree ring data is not temperature (say that out loud). This is why it is called a proxy. It is a perfect proxy? Was that last question a rhetorical one? Was that one, too? Because it is a proxy, the uncertainty of its ability to predict temperature must be taken into account in the final results. Did Mann do this? And just what is a rhetorical question?
  • There are hundreds of time series analysis methods, most with the purpose of trying to understand the uncertainty of the process so that future data can be predicted, and the uncertainty of those predictions can be quantified (this is a huge area of study in, for example, financial markets, for good reason). This is a legitimate use of smoothing and modeling.
  • We certainly should model the relationship of the proxy and temperature, taking into account the changing nature of proxy through time, the differing physical processes that will cause the proxy to change regardless of temperature or how temperature exacerbates or quashes them, and on and on. But we should not stop, as everybody has stopped, with saying something about the parameters of the probability models used to quantify these relationships. Doing so makes use, once again, far too certain of the final results. We do not care how the proxy predicts the mean temperature, we do care how the proxy predicts temperature.
  • We do not need a statistical test to say whether a particular time series has increased since some time point. Why? If you do not know, go back and read these points from the beginning. It’s because all we have to do is look at the data: if it has increased, we are allowed to say “It increased.” If it did not increase or it decreased, then we are not allowed to say “It increased.” It really is as simple as that.
  • You will now say to me “OK Mr Smarty Pants. What if we had several different time series from different locations? How can we tell if there is a general increase across all of them? We certainly need statistics and p-values and Monte Carol routines to tell us that they increased or that the ‘null hypothesis’ of no increase is true.” First, nobody has called me “Mr Smarty Pants” for a long time, so you’d better watch your language. Second, weren’t you paying attention? If you want to say that 52 out 413 times series increased since some time point, then just go and look at the time series and count! If 52 out of 413 times series increased then you can say “52 out of 413 time series increased.” If more or less than 52 out of 413 times series increased, then you cannot say that “52 out of 413 time series increased.” Well, you can say it, but you would be lying. There is absolutely no need whatsoever to chatter about null hypotheses etc.

If the points—it really is just one point—I am making seem tedious to you, then I will have succeeded. The only fair way to talk about past, known data in statistics is just by looking at it. It is true that looking at massive data sets is difficult and still somewhat of an art. But looking is looking and it’s utterly evenhanded. If you want to say how your data was related with other data, then again, all you have to do is look.

The only reason to create a statistical model is to predict data you have not seen. In the case of the proxy/temperature data, we have the proxies but we do not have temperature, so we can certainly use a probability model to quantify our uncertainty in the unseen temperatures. But we can only create these models when we have simultaneous measures of the proxies and temperature. After these models are created, we then go back to where we do not have temperature and we can predict it (remembering to predict not its mean but the actual values; you also have to take into account how the temperature/proxy relationship might have been different in the past, and how the other conditions extant would have modified this relationship, and on and on).

What you can not, or should not, do is to first model/smooth the proxy data to produce fictional data and then try to model the fictional data and temperature. This trick will always—simply always—make you too certain of yourself and will lead you astray. Notice how the read fictional data looks a hell of a lot more structured than the real data and you’ll get the idea.

Next step is to start playing with the proxy data itself and see what is to see. As soon as I am granted my wish to have each day filled with 48 hours, I’ll be able to do it.

Thanks to Gabe Thornhill of Thornhill Securities for reminding me to write about this.