I am, of course, a statistician. So perhaps it will seem unusual to you when I say I wish there were fewer statistics done. And by that I mean that I’d like to see less statistical *modeling* done. I am happy to have more data collected, but am far less sanguine about the proliferation of studies based on statistical methods.

There are lots of reasons for this, which I will detail from time to time, but one of the main ones is how easy it is to mislead yourself, particularly if you use statistical procedures in a cookbook fashion. It takes more than a recipe to make an eatable cake.

Among the worst offenders are methods like *data mining*, sometimes called *knowledge discovery*, *neural networks*, and other methods that “automatically” find “significant” relationships between sets of data. In theory, there is *nothing* wrong with any of these methods. They are not, by themselves, evil. But they become pernicious when used without a true understanding of the data and the possible causal relationships that exist.

However, these methods are in continuous use and are highly touted. An oft-quoted success of data mining was the time a grocery store noticed that unaccompanied men who bought diapers also bought beer. A relationship between data which, we are told, would have gone unnoticed were it not for “powerful computer models.”

I don’t want to appear too negative: these methods can work and they are often used wisely. They can uncover previously unsuspected relationships that can be confirmed or disconfirmed upon collecting new data. Things only go sour when this second step, verifying the relationships with independent data, is ignored. Unfortunately, the temptation to forgo the all-important second step is usually overwhelming. Pressures such as cost of collecting new data, the desire to publish quickly, an inflated sense of certainty, and so on, all contribute to this prematurity.

#### Stepwise

Stepwise regression is a procedure to find the “best” model to predict y given a set of x’s. The y might be the item most likely bought (like beer) given a set of possible explanatory variables x, like x_{1} sex, x_{2} total amount spent, x_{3} diapers purchased or not, and on and on. The y might instead be total amount spent at a mall, or the probability of defaulting on a loan, or any other response you want to predict. The possibilities for the explanatory variables, the x’s, are limited only to your imagination and ability to collect data.

A regression takes the y and tried to find a multi-dimensional straight line fit between itself and the x’s (e.g., a two-dimensional straight line is a plane). Not all of the x’s will be “statistically significant^{1}“; those that are not are eliminated from the final equation. We only want to keep those x’s that are helpful in explaining y. In order to do that, we need to have some measure of model “goodness”. The **best** measure of model goodness is one which measures how well that model does predicting independent data, which is data that in *no way* was used to fit the model. But obviously, we do not always have such data at hand, so we need another measure. One that is often picked is the Akaike Information Criterion (AIC), which measures how well the model fits the data that was used to fit the model.

Confusing? You don’t actually need to know anything about the AIC other than that *lower numbers are better*. Besides, the computer does the work for you, so you never have to actually learn about the AIC. What happens is that many combinations of x’s are tried, one by one, an AIC is computed for that combination, and the combination that has the lowest AIC becomes the “best” model. For example, combination 1 might contain (x_{2}, x_{17}, x_{22}), while combination 2 might contain (x_{1}, x_{3}). When the number of x’s is large, the number of possible combinations is huge, so some sort of automatic process is needed to find the best model.

A summary: all your data is fed into a computer, and you want to model a response based on a large number of possible explanatory variables. The computer sorts through all the possible combinations of these explanatory variables, rates them by a model goodness criterion, and picks the one that is best. What could go wrong?

To show you how easy it is to mislead yourself with stepwise procedures, I did the following simulation. I generated 100 observations for y’s and 50 x’s (each of 100 observations of course). All of the observations were just made up numbers, each giving no information about the other. There are *no* relationships between the x’s and the y^{2}. The computer, then, should tell me that the best model is no model at all.

But here is what it found: the stepwise procedure gave me a best combination model with 7 out of the original 50 x’s. But only 4 of those x’s met the usually criterion for being kept in a model (explained below), so my final model is this one:

explan. | p-value | Pr(beta x| data)>0 |
---|---|---|

x_{7} |
0.0053 | 0.991 |

x_{21} |
0.046 | 0.976 |

x_{27} |
0.00045 | 0.996 |

x_{43} |
0.0063 | 0.996 |

In classical statistics, an explanatory variable is kept in the model if it has a *p-value*< 0.05. In Bayesian statistics, an explanatory variable is kept in the model when the probability of that variable (well, of its coefficient being non-zero) is larger than, say, 0.90. Don't worry if you don't understand what any of that means---just know this: this model would pass any test, classical or modern, as being good. The model even had an adjusted R^{2} of 0.26, which is considered excellent in many fields (like marketing or sociology; R^{2} is a number between 0 and 1, higher numbers are better).

Nobody, or very very few, would notice that this model is completely made up. The reason is that, in real life, each of these x’s would have a name attached to it. If, for example, y was the amount spent on travel in a year, then some x’s might be x_{7}=”married or not”, x_{21}=”number of kids”, and so on. It is just too easy to concoct a reasonable story *after the fact* to say, “Of course, x_{7} should be in the model: after all, married people take vacations differently than do single people.” You might even then go on to publish a paper in the *Journal of Hospitality Trends* showing “statistically significant” relationships between being married and travel model spent.

And you would be believed.

I wouldn’t believe you, however, until you showed me how your model performed on a set of *new* data, say from next year’s travel figures. But this is so rarely done that I have yet to run across an example of it. When was the last time anybody read an article in a sociological, psychological, etc., journal in which truly independent data is used to show how a previously built model performed well or failed? If any of my readers have seen this, please drop me a note: you will have made the equivalent of a cryptozoological find.

Incidentally, generating these spurious models is effortless. I didn’t go through 100s of simulations to find one that looked especially misleading. I did just one simulation. Using this stepwise procedure practically guarantees that you will find a “statistically significant” yet spurious model.

^{1}I will explain this unfortunate term later.

^{2}I first did a “univariate analysis” and only fed into the stepwise routine those x’s which singly had p-values < 0.1. This is done to ease the computational burden of checking all models by first eliminating those x's which are unlikely to be "important." This is also a distressingly common procedure.

Here is the simulation code, to be run in the free and open source R statistical software:

library(MASS); # need be run only once per session n<-100; y<-rnorm(n) # "random" response X<-matrix(rnorm(n*n/2),n,n/2) # "random" x's f<-0 for (i in 1:(n/2)){ # univariate analysis; f stores p-values f[i]<-anova(lm(y~X[,i]))$Pr[1] } i<-(f<.1) # only keep x's with p-values < 0.1 w<-data.frame(y,X[,i]) # w is just those x's with p<0.1 and y fit<-lm(y~.,data=w) # model object to feed into stepwise fit.aic <- stepAIC(fit) # stepwise summary(fit.aic) # final model summary

February 25, 2008 at 8:28 am

Thanks very much for the information, William. Combining these and similar methods for modeling spurious correlations, with insider knowledge of government grants, should provide any aspiring researcher with a lifetime of profitable scientific publishing.

We scientists can travel the world as part of our jobs, making presentations and collecting data. If our data are concocted and our conclusions spurious, at least our grants are secure and our statistical models are significant.

February 25, 2008 at 8:37 am

I don’t agree with the points you made here. Tradtiionally here in data mining / machine learning, we use the cross-validation method to validate the model we learned… I don’t see any possible problem of such a statistical significance test, since we were not assuming any information from the left 10% data (say, if it’s a ten-fold CV), when we used the remaining 90% for training…

February 25, 2008 at 9:54 am

Nice post. I’m curious what you think of fitting a model to half (or some other proportion) of a data set and testing it against the other half. Obviously, this could be done repeatedly with new (re)samples for the fitting and testing. Wouldn’t this play essentially the same role as validating a model against a newly collected data set?

Maybe I’ll do this with your R code and see what happens…

February 25, 2008 at 10:12 am

I just noticed that this post is filed under ‘General Statistics, Bad Statistics, Global warming’.

Why global warming? Is stepwise regression particularly common in global warming research?

Also, as an aside, it seems very odd to me when Bayesian statistical means are used for frequentist ends, as in the case of calculating p(beta x | data) > 0 to decide whether to keep x in a model. Bayesian stats seem useful (and interesting) to me because they provide information about (posterior)

distributionsof parameters (and because they incorporate prior knowledge explicitly).February 25, 2008 at 10:16 am

Derek,

Cross validation is somewhat helpful, but no panacea because you’re still using all of the data that you have to fit your current model. None of it is truly independent. This is because people do not just fit one model, test it one time on the hold-out sample, and then quit. They test several models on the hold-out sample, and then pick the best (or one of the best). (Other forms such as k-fold cross validation always use all of the data.)

I don’t, and did not say, that data mining etc. techniques should not be used, or that some people do not use them well. I think that most do not: and by most I mean the vast multitude of non-academics who use statistical software on a daily basis.

Noah, I think this might have answered your cross-validation question, too.

Briggs

February 25, 2008 at 10:22 am

Noah,

I don’t think stepwise procedures are often used in climatology per se, but they are certainly used downstream of it, in the host of models that purport to show the effects of global warming, particularly those studies examining human behavior.

Briggs

February 25, 2008 at 6:44 pm

As far as I can tell most of the multi-proxy temperature reconstructions use sophisticated variants of this method to tell us what temperatures were in the past. They do do validation tests, but then ignore the best validation test of all, namely new data. Indeed when, for example, new data are available for, say tree ring widths which do not show the expected increase in the late 20th century instead of saying “oh we got it wrong. These tree species do not in fact correlate with temperature”, they christen the phenomenon “the divergence problem” and speculate that it is due to hitherto unsuspected and unprecedented influences of human activity. For extensive documentation see http://www.climateaudit.org.

February 26, 2008 at 6:31 am

mikep, that’ll be “sophisticated variants” in the sense of preposterous piffle, eh?

February 26, 2008 at 8:50 am

Briggs,

I agree with you in some part. However, the problem you posed in your R code is essentially not a problem of model selection, which was usually one of the main theme in machine learning. Here, lots of people “assume” that the data can be nicely split into two parts, by using “some kind of” classifier, or it “may follow” some distribution, however, none of these prior knowledge is known or is possible to known. Even you demonstrate that the code actually “learns” some polynomial classifier from a pair that is actually independent. It merely means nothing because it was you who did the feature selection work…

Anyway, I don’t think there is a panacea to solve this problem. Using one data set for training and another data set for testing also raises the question of i.i.d. assumption. We still cannot firmly believe that the training data set and the test data set follows the same distribution…So…you know what I mean. This is a problem that long exists, and will exist forever…

Derek

February 26, 2008 at 9:35 am

My impression (based on decade old discussions with a friend who did research in the field) is that if you look at bankruptcy prediction literature, you typically will find that the authors maintain a holdout sample on which to test their model. That is, they use half the available data (e.g., half of all NYSE listed companies) to build the model, then test it on the other half of the data set (the rest of the NYSE companies). These are probabal;istic models that seek to estimate the probabilty of bankruptcy.

More broadly I was confused by reference to Bayesians in the post. I took a course with Robert Schlaiffer, a Bayesian statistician. He too was firmly opposed to step-wsie regressions. But he never counselled dropping variables from a model if their t-stat

February 26, 2008 at 9:17 pm

Years ago, when I was young and single, I used to frequent the racetrack. Poring over old racing forms, I developed handicapping systems that were very reliable and profitable. Unfortunately, when I went to the race track to cash in on my systems, I learned about the hazards of data mining -A. McIntire

February 28, 2008 at 5:31 pm

Good point, but there are other, stricter, criteria, like Schwartz’s and the various Message length, as well as the various dodges the Leo Breiman developed.

May 4, 2008 at 9:44 pm

mbh strikes me as data-miney. Feed a mess of series into a hopper, correlated with global temps (not local)…see what you get. Do a data divisiion, but with strong trending occuring, for validation. And not out of sample validation.

May 5, 2008 at 6:10 pm

I actually like what I learned in DOE class, which was to use a manual program that let you fiddle with adding factors, (and squared factors and interactions). That made you think about what factors might be important first. That said to keep lots of degrees of freedom. That said to get the major importance…over nothing…with few factors…and not to emphasize trying to polish the apple too much with adding more and more…as overfitting danger increased.

May 8, 2008 at 2:42 pm

Nice entry ! The gene-related data that are currently being gathered in biostatistics seem to ve bery prone to this issue. There you also have way more variables than sample units.

It reminded me of an old paper of Freedman (A note on screening regression equations, The American Statistician, 1983), where he did something similar as you described.

I tend to roughly summarize this issue as a warning to be carefull to not end up with statistical models where you are predicting noise based upon other independent noise…but perhaps that’s a bit overexagerated š