Example of how easy it is to mislead yourself: stepwise regression

I am, of course, a statistician. So perhaps it will seem unusual to you when I say I wish there were fewer statistics done. And by that I mean that I’d like to see less statistical modeling done. I am happy to have more data collected, but am far less sanguine about the proliferation of studies based on statistical methods.

There are lots of reasons for this, which I will detail from time to time, but one of the main ones is how easy it is to mislead yourself, particularly if you use statistical procedures in a cookbook fashion. It takes more than a recipe to make an eatable cake.

Among the worst offenders are methods like data mining, sometimes called knowledge discovery, neural networks, and other methods that “automatically” find “significant” relationships between sets of data. In theory, there is nothing wrong with any of these methods. They are not, by themselves, evil. But they become pernicious when used without a true understanding of the data and the possible causal relationships that exist.

However, these methods are in continuous use and are highly touted. An oft-quoted success of data mining was the time a grocery store noticed that unaccompanied men who bought diapers also bought beer. A relationship between data which, we are told, would have gone unnoticed were it not for “powerful computer models.”

I don’t want to appear too negative: these methods can work and they are often used wisely. They can uncover previously unsuspected relationships that can be confirmed or disconfirmed upon collecting new data. Things only go sour when this second step, verifying the relationships with independent data, is ignored. Unfortunately, the temptation to forgo the all-important second step is usually overwhelming. Pressures such as cost of collecting new data, the desire to publish quickly, an inflated sense of certainty, and so on, all contribute to this prematurity.

Stepwise

Stepwise regression is a procedure to find the “best” model to predict y given a set of x’s. The y might be the item most likely bought (like beer) given a set of possible explanatory variables x, like x1 sex, x2 total amount spent, x3 diapers purchased or not, and on and on. The y might instead be total amount spent at a mall, or the probability of defaulting on a loan, or any other response you want to predict. The possibilities for the explanatory variables, the x’s, are limited only to your imagination and ability to collect data.

A regression takes the y and tried to find a multi-dimensional straight line fit between itself and the x’s (e.g., a two-dimensional straight line is a plane). Not all of the x’s will be “statistically significant1“; those that are not are eliminated from the final equation. We only want to keep those x’s that are helpful in explaining y. In order to do that, we need to have some measure of model “goodness”. The best measure of model goodness is one which measures how well that model does predicting independent data, which is data that in no way was used to fit the model. But obviously, we do not always have such data at hand, so we need another measure. One that is often picked is the Akaike Information Criterion (AIC), which measures how well the model fits the data that was used to fit the model.

Confusing? You don’t actually need to know anything about the AIC other than that lower numbers are better. Besides, the computer does the work for you, so you never have to actually learn about the AIC. What happens is that many combinations of x’s are tried, one by one, an AIC is computed for that combination, and the combination that has the lowest AIC becomes the “best” model. For example, combination 1 might contain (x2, x17, x22), while combination 2 might contain (x1, x3). When the number of x’s is large, the number of possible combinations is huge, so some sort of automatic process is needed to find the best model.

A summary: all your data is fed into a computer, and you want to model a response based on a large number of possible explanatory variables. The computer sorts through all the possible combinations of these explanatory variables, rates them by a model goodness criterion, and picks the one that is best. What could go wrong?

To show you how easy it is to mislead yourself with stepwise procedures, I did the following simulation. I generated 100 observations for y’s and 50 x’s (each of 100 observations of course). All of the observations were just made up numbers, each giving no information about the other. There are no relationships between the x’s and the y2. The computer, then, should tell me that the best model is no model at all.

But here is what it found: the stepwise procedure gave me a best combination model with 7 out of the original 50 x’s. But only 4 of those x’s met the usually criterion for being kept in a model (explained below), so my final model is this one:

explan. p-value Pr(beta x| data)>0
x7 0.0053 0.991
x21 0.046 0.976
x27 0.00045 0.996
x43 0.0063 0.996

In classical statistics, an explanatory variable is kept in the model if it has a p-value< 0.05. In Bayesian statistics, an explanatory variable is kept in the model when the probability of that variable (well, of its coefficient being non-zero) is larger than, say, 0.90. Don't worry if you don't understand what any of that means---just know this: this model would pass any test, classical or modern, as being good. The model even had an adjusted R2 of 0.26, which is considered excellent in many fields (like marketing or sociology; R2 is a number between 0 and 1, higher numbers are better).

Nobody, or very very few, would notice that this model is completely made up. The reason is that, in real life, each of these x’s would have a name attached to it. If, for example, y was the amount spent on travel in a year, then some x’s might be x7=”married or not”, x21=”number of kids”, and so on. It is just too easy to concoct a reasonable story after the fact to say, “Of course, x7 should be in the model: after all, married people take vacations differently than do single people.” You might even then go on to publish a paper in the Journal of Hospitality Trends showing “statistically significant” relationships between being married and travel model spent.

And you would be believed.

I wouldn’t believe you, however, until you showed me how your model performed on a set of new data, say from next year’s travel figures. But this is so rarely done that I have yet to run across an example of it. When was the last time anybody read an article in a sociological, psychological, etc., journal in which truly independent data is used to show how a previously built model performed well or failed? If any of my readers have seen this, please drop me a note: you will have made the equivalent of a cryptozoological find.

Incidentally, generating these spurious models is effortless. I didn’t go through 100s of simulations to find one that looked especially misleading. I did just one simulation. Using this stepwise procedure practically guarantees that you will find a “statistically significant” yet spurious model.

1I will explain this unfortunate term later.
2I first did a “univariate analysis” and only fed into the stepwise routine those x’s which singly had p-values < 0.1. This is done to ease the computational burden of checking all models by first eliminating those x's which are unlikely to be "important." This is also a distressingly common procedure.

You cannot measure a mean

I often say---it is even the main theme of this blog---that people are too certain. This is especially true when people report results from classical statistics, or use classical methods…

Stats 101: Chapter 3

Three is ready to go. I should re-emphasize one of the goals of this book. It is meant to be for that large host of unfortunates who are forced---I mean…

Stats 101: Chapter 4

Chapter 4 is ready to go.

This is where it starts to get weird. The first part of the chapter introduces the standard notation of “random” variables, and then works through a binomial example, which is simple enough.

Then come the so-called normals. However, they are anything but. For probably most people, it will be the first time that they hear about the strange creatures called continuous numbers. It will be more surprising to learn that not all mathematicians like these things or agree with their necessity, particularly in problems like quantifying probability for real observable things.

I use the word “real” in its everyday, English sense of something that is tangible or that exists. This is because mathematicians have co-opted the word “real” to mean “continuous”, which in an infinite amount of cases means “not real” or “not tangible” or even “not observable or computable.” Why use these kinds of numbers? Strange as it might seem, using continuous numbers makes the math work out easier!

Again, what is below is a teaser for the book. The equations and pictures don’t come across well, and neither do the footnotes. For the complete treatment, download the actual Chapter.

Distributions

1. Variables

Recall that random means unknown. Suppose x represents the number of times the Central Michigan University football team wins next year. Nobody knows what this number will be, though we can, of course, guess. Further suppose that the chance that CMU wins any individual game is 2 out of 3, and that (somewhat unrealistically), a win or loss in any one game is irrelevant to the chance that they win or lose any other game. We also know that there will be 12 games. Lastly, suppose that this is all we know. Label this evidence E. That is, we will ignore all information about who the future teams are, what the coach has leaked to the press, how often the band has practiced their pep songs, what students will fail their statistics course and will thus be booted from the team, and so on. What, then, can we say about x?

We know that x can equal 0, or 1, or any number up to 12. It’s unlikely that CMU will loss or win every game, but they?ll prob ably win, say, somewhere around 2/3s, or 6-10, of them. Again, the exact value of x is random, that is, unknown.

Now, if last chapter you weren?t distracted by texting messages about how great this book is, this situation might feel a little familiar. If we instead let x (instead of k?remember these letters are place holders, so whichever one we use does not mat
ter) represent the number of classmates you drive home, where the chance that you take any of them is 10%, we know we can figure out the answer using the binomial formula. Our evidence then was EB . And so it is here, too, when x represents the number of games won. We?ve already seen the binomial formula written in two ways, but yet another (and final) way to write it is this:

x|n, p, EB ? Binomial(n, p).

This (mathematical) sentence reads “Our uncertainty in x, the number of games the football team will win next year, is best represented by the Binomial formula, where we know n, p, and our information is EB .” The “?” symbol has a technical definition: “is distributed as.” So another way to read this sentence is “Our uncertainty in x is distributed as Binomial where we know n, etc.” The “is distributed as” is longhand for “quantified.” Some people leave out the “Our uncertainty in”, which is OK if you remember it is there, but is bad news otherwise. This is because people have a habit of imbuing x itself with some mystical properties, as if “x” itself had a “random” life. Never forget, however, that it is just a placeholder for the statement X = “The team will win x games”, and that this statement may be true or false, and it?s up to us to quantify the probability of it being true.

In classic terms, x is called a “random variable”. To us, who do not need the vague mysticism associated with the word random, x is just an unknown number, though there is little harm in calling it a “variable,” because it can vary over a range of numbers. However, all classical, and even much Bayesian, statistical theory uses the term “random variable”, so we must learn to work with it.

Above, we guessed that the team would win about 6-10 games. Where do these number come from? Obviously, based on the knowledge that the chance of winning any game was 2/3 and there?d be twelve games. But let?s ask more specific questions. What is the probability of winning no games, or X = “The team will win x = 0 games”; that is, what is Pr(x = 0|n, p, EB )? That’s easy: from our binomial formula, this is (see the book) ? 2 in a million. We don’t need to calculate n choose 0 because we know it?s 1; likewise, we don?t need to worry about 0.670^0 because we know that?s 1, too. What is the chance the team wins all its games? Just Pr(x = 12|n, p, EB ). From the binomial, this is (see the book) ? 0.008 (check this). Not very good!

Recall we know that x can take any value from zero to twelve. The most natural question is: what number of games is CMU most likely to win? Well, that’s the value of x that makes (see the book) the largest, i.e. the most probable. This is easy for a computer to do (you’ll learn how next Chapter). It turns out to be 8 games, which has about a one in four chance of happening. We could go on and calculate the rest of the probabilities, for each possible x, just as easily.

What is the most likely number of games the team will win is the most natural question for us, but in pre-computer classical statistics, there turns out to be a different natural question, and this has something to do with creatures called expected values. That term turns out to be a terrible misnomer, because we often do not, and cannot, expect any of the values that the “expected value” calculations give us. The reason expected values are of interest has to do with some mathematics that are not of especial interest here; however, we will have to take a look at them because it is expected of one to do so.