Example of how easy it is to mislead yourself: stepwise regression

I am, of course, a statistician. So perhaps it will seem unusual to you when I say I wish there were fewer statistics done. And by that I mean that I’d like to see less statistical modeling done. I am happy to have more data collected, but am far less sanguine about the proliferation of studies based on statistical methods.

There are lots of reasons for this, which I will detail from time to time, but one of the main ones is how easy it is to mislead yourself, particularly if you use statistical procedures in a cookbook fashion. It takes more than a recipe to make an eatable cake.

Among the worst offenders are methods like data mining, sometimes called knowledge discovery, neural networks, and other methods that “automatically” find “significant” relationships between sets of data. In theory, there is nothing wrong with any of these methods. They are not, by themselves, evil. But they become pernicious when used without a true understanding of the data and the possible causal relationships that exist.

However, these methods are in continuous use and are highly touted. An oft-quoted success of data mining was the time a grocery store noticed that unaccompanied men who bought diapers also bought beer. A relationship between data which, we are told, would have gone unnoticed were it not for “powerful computer models.”

I don’t want to appear too negative: these methods can work and they are often used wisely. They can uncover previously unsuspected relationships that can be confirmed or disconfirmed upon collecting new data. Things only go sour when this second step, verifying the relationships with independent data, is ignored. Unfortunately, the temptation to forgo the all-important second step is usually overwhelming. Pressures such as cost of collecting new data, the desire to publish quickly, an inflated sense of certainty, and so on, all contribute to this prematurity.

Stepwise

Stepwise regression is a procedure to find the “best” model to predict y given a set of x’s. The y might be the item most likely bought (like beer) given a set of possible explanatory variables x, like x1 sex, x2 total amount spent, x3 diapers purchased or not, and on and on. The y might instead be total amount spent at a mall, or the probability of defaulting on a loan, or any other response you want to predict. The possibilities for the explanatory variables, the x’s, are limited only to your imagination and ability to collect data.

A regression takes the y and tried to find a multi-dimensional straight line fit between itself and the x’s (e.g., a two-dimensional straight line is a plane). Not all of the x’s will be “statistically significant1“; those that are not are eliminated from the final equation. We only want to keep those x’s that are helpful in explaining y. In order to do that, we need to have some measure of model “goodness”. The best measure of model goodness is one which measures how well that model does predicting independent data, which is data that in no way was used to fit the model. But obviously, we do not always have such data at hand, so we need another measure. One that is often picked is the Akaike Information Criterion (AIC), which measures how well the model fits the data that was used to fit the model.

Confusing? You don’t actually need to know anything about the AIC other than that lower numbers are better. Besides, the computer does the work for you, so you never have to actually learn about the AIC. What happens is that many combinations of x’s are tried, one by one, an AIC is computed for that combination, and the combination that has the lowest AIC becomes the “best” model. For example, combination 1 might contain (x2, x17, x22), while combination 2 might contain (x1, x3). When the number of x’s is large, the number of possible combinations is huge, so some sort of automatic process is needed to find the best model.

A summary: all your data is fed into a computer, and you want to model a response based on a large number of possible explanatory variables. The computer sorts through all the possible combinations of these explanatory variables, rates them by a model goodness criterion, and picks the one that is best. What could go wrong?

To show you how easy it is to mislead yourself with stepwise procedures, I did the following simulation. I generated 100 observations for y’s and 50 x’s (each of 100 observations of course). All of the observations were just made up numbers, each giving no information about the other. There are no relationships between the x’s and the y2. The computer, then, should tell me that the best model is no model at all.

But here is what it found: the stepwise procedure gave me a best combination model with 7 out of the original 50 x’s. But only 4 of those x’s met the usually criterion for being kept in a model (explained below), so my final model is this one:

explan. p-value Pr(beta x| data)>0
x7 0.0053 0.991
x21 0.046 0.976
x27 0.00045 0.996
x43 0.0063 0.996

In classical statistics, an explanatory variable is kept in the model if it has a p-value< 0.05. In Bayesian statistics, an explanatory variable is kept in the model when the probability of that variable (well, of its coefficient being non-zero) is larger than, say, 0.90. Don't worry if you don't understand what any of that means---just know this: this model would pass any test, classical or modern, as being good. The model even had an adjusted R2 of 0.26, which is considered excellent in many fields (like marketing or sociology; R2 is a number between 0 and 1, higher numbers are better).

Nobody, or very very few, would notice that this model is completely made up. The reason is that, in real life, each of these x’s would have a name attached to it. If, for example, y was the amount spent on travel in a year, then some x’s might be x7=”married or not”, x21=”number of kids”, and so on. It is just too easy to concoct a reasonable story after the fact to say, “Of course, x7 should be in the model: after all, married people take vacations differently than do single people.” You might even then go on to publish a paper in the Journal of Hospitality Trends showing “statistically significant” relationships between being married and travel model spent.

And you would be believed.

I wouldn’t believe you, however, until you showed me how your model performed on a set of new data, say from next year’s travel figures. But this is so rarely done that I have yet to run across an example of it. When was the last time anybody read an article in a sociological, psychological, etc., journal in which truly independent data is used to show how a previously built model performed well or failed? If any of my readers have seen this, please drop me a note: you will have made the equivalent of a cryptozoological find.

Incidentally, generating these spurious models is effortless. I didn’t go through 100s of simulations to find one that looked especially misleading. I did just one simulation. Using this stepwise procedure practically guarantees that you will find a “statistically significant” yet spurious model.

1I will explain this unfortunate term later.
2I first did a “univariate analysis” and only fed into the stepwise routine those x’s which singly had p-values < 0.1. This is done to ease the computational burden of checking all models by first eliminating those x's which are unlikely to be "important." This is also a distressingly common procedure.

The B.S. octopus

Jonathan Bate, of Standpoint, recently wrote an essay "The wrong idea of a university": It used to work like this. Dr Bloggs, the brilliant scholar who had solved the problem…

Suicides increase due to reading atrocious global warming research papers

I had the knife at my throat after reading a paper by Preti, Lentini, and Maugeri in the Journal of Affective Disorders (2007 (102), pp 19-25; thanks to Marc Morano for the link to World Climate Report where this work was originally reported). The study had me so depressed that I seriously thought of ending it all.

Before I tell you what the title of their paper is, take a look at these two pictures:

temperature in Italy 1974 to 2003
number of suicides in Italy 1974 to 2003

The first is the yearly mean temperature from 1974 to 2003 in Italy: perhaps a slight decrease to 1980-ish, increasing after that. The second pictures are the suicide rates for men (top) and women (bottom) over the same time period. Ignore the solid line on the suicide plots for a moment and answer this question: what do these two sets of numbers, temperature and suicide, have to do with one another?

If you answered “nothing,” then you are not qualified to be a peer-reviewed researcher in the all-important field of global warming risk research. By failing to see any correlation, you have proven yourself unimaginative and politically naive.

Crack researchers Preti and his pals, on the other hand, were able to look at this same data and proclaim nothing less than Global warming possibly liked to an enhanced risk of suicide.” (Thanks to BufordP at FreeRepublic for the link to the on-line version of the paper.)

How did they do it, you ask? How, when the data look absolutely unrelated, were they able to show a concatenation? Simple: by cheating. I’m going to tell you how they did it later, but how—and why—they got away with it is another matter. It is the fact that they didn’t get caught which fills me with despair and gives rise to my suicidal thoughts.

Why were they allowed to publish? People—and journal editors are in that class—are evidently so hungry for a fright, so eager to learn that their worst fears of global warming are being realized, that they will accept nearly any evidence which corroborates this desire, even if this evidence is transparently ridiculous, as it is here. Every generation has its fads and fallacies, and the evil supposed to be caused by global warming is our fixation.

Below, is how they cheated. The subject is somewhat technical, so don’t bother unless you want particulars. I will go into some detail because it is important to understand just how bad something can be but still pass for “peer-reviewed scientific research.” Let me say first that if one of my students tried handing in a paper like Preti et alia’s, I’d gently ask, “Weren’t you listening to anything I said the entire semester!”

Science is decided by committee

Scientists still do not appear to understand sufficiently that all earth sciences must contribute evidence toward unveiling the state of our planet in earlier times, and that the truth of…

New Arcsine Climate Forecast: Hot and Cold!

If you weren’t worried before, then take a look at this shocking new climate forecast!
Arcsine climate forecast

No, only kidding. This is the real forecast:
Arcsine climate forecast

Sorry. Can’t help myself. Here are four more “forecasts”.

Arcsine climate forecast

Each of the “forecasts” were generated by what is called a “random walk.” Here is what that is. Grab a coin and go out and stand on a corner of some sidewalk that stretches for a long way in both directions. Call one direction “positive” and the other “negative”. The corner you start at will be called “zero”. Flip your coin: If it is heads, then take one step forward toward positive; if tails, then take one backward toward negative. Keep doing this for a long time and soon you will find…that your neighbors think you are crazy.

But that’s a random walk. If you do the coin flips and steps for a long enough time, you’ll find that you spend a heck of lot more time than you might have guessed on either the positive or the negative side. You will probably find that, when you quit, you are way up along the positive side, or way down along the negative. This is true even though the average of those coin flips, the +1s and -1s that make up your steps, is pretty near 0; and even though the average goes to 0 the longer you flip the coin.

The “climate model forecasts” generated above were done so by reference to a paper by A.H. Gordon, available here, called “Global Warming as a Manifestation of a Random Walk”. It is a very readable paper that bears attention.

Gordon proposed that a climate could be made by generating random “shocks” to a climate system. What’s that? Well, imagine the climate is going along peacefully, maintaining its temperature and minding it’s own business, when suddenly—bame!—some external force causes it to change its temperature up or down. An external force might be a change in the Earth’s orbit, or a shifting in cloud cover, or a flock of birds flying this way or that, or anything. This shock persists in the system for some time; little shocks build up and over the course of a year the climate—the mean temperature—changes. It is just as likely for this random-walk climate’s temperature to go up as it is to go down,.

Random walks have some surprising properties which, by virtue of being surprising, are not intuitive. The first is that we’d expect adding random ups and downs (1s and -1s) together would get us a bunch of no changes (or 0s). We don’t get 0s, but numbers which travel far from 0 as time goes on. In fact, it can be shown—via something called the arcsine law—that it’s more probable that this climate will be at an extreme value whenever the series stops, and will not be near 0. The pictures show this.

What about the real climate, the one we actually live in? It’s certainly true that the real climate experiences external shocks of every kind. Gordon found (over the period he looked and with one particular, often used data set) that temperatures went up about just as many times as it went down, just like what would be expected in a random walk climate. He found that the value of the temperature at the end of the series he had was an extreme one, just like we would be expect in a random walk climate. He made a lot of pictures, like we have, and noticed that a lot of them look just like our real climate.

The pictures that make up our and Gordon’s “arcsine forecast”, for technical reasons, aren’t 1s and -1s, but numbers simulated from a normal distribution with a central parameter of 0, which means the numbers are equally likely to be above 0 as below 0 just like in the -1/+1 random walk, but here they can be any number greater or less than 0 (the standard deviation parameter, for those who know of such things, is set at 0.12, which is the same as the estimated standard deviation parameter for actual global mean temperature; see Gordon’s paper for a fuller description).

What does all this say about the real climate? That it happens to look just like a bunch of random numbers. Gordon cautions, “That is not to say that the temperature record is a random walk, but that it does possess similar features.” The surface temperature records “also exhibits properties of the arc-sine law. It is concluded that the global series could have arisen from random fluctuations and could therefore be analogous to arc-sine law governed by random walks.”

This means the climate we have might be less controllable than we thought it was (controllable negatively or positively through man’s activities).

He ends with some sage advice:

It is important to examine all ways and means by which the observed data series develop trends before facing hard and fast conclusions that any particular activity is the one and only responsible agent.

Below is the code where you can generate your own arcsine climate model forecast in R: