This is the start of a series of reference articles which explain analytical techniques, with a focus on the philosophy, understanding, common mistakes, and not mathematics.
Regression, a.k.a. “linear regression”, is the most-used analysis technique, responsible for nearly all headlines which begin, “New research shows…” Its misuse is also the biggest reason for scientists, a.k.a. “researchers”, believe and promulgate nonsense.
The technique is fairly easy to grasp, at least in the sense that its implementation is trivial. And that is the problem. It’s too easy; or, rather, cheap software combined with a bit of magical thinking (decision by wee p-values) make using regression painless. “Results” are for free the asking—and you know the saying about getting what you pay for. Which is why there is a flood of papers gushing out of academia “proving” anything researchers want to believe.
So what is regression? Let’s first discuss what it should be in broad outline form, eschewing all technical details, which we’ll come to later. Let’s not worry about how it works—no distributions, parameters, or p-values this time—but what it means. All along I’ll give tips on how it is misused and misinterpreted.
Start simple. You have some thing, which is represented by a number, and you want to express the uncertainty that the thing takes certain values. It is customary and a great convenience to call this thing “y”. Y might be a grade point average, an amount of money, tomorrow’s high temperature, an answer to an arbitrary question (“On a scale of 1 to 5…”), and so on endlessly.
Sociologists, who form the largest group of abusers of statistical methods, are great ones for inventing questions and imbuing them with terrible meaning. Typically, they create a questionnaire the answers of which are coded numerically. From these a “scale” is derived, i.e. a number which is a function of the answers. This scale usually ranges from 1 to 5, or from 1 to 9, or something like that. It is always given a hopeful name, like “The Conscientious Index”, “Openness to Change,” or “General Health”. (More on this in another post.)
To fix ideas, use the fictional “Hate Scale” for our y which is comprised of the single question, “On a scale of 1 to 10, how much do you despise those who disagree with you politically?” Apt for our current political milieu. We want to understand the uncertainty of a person answering this question. With what probability will he answer 1? 2? and so on. This is what regression is meant to tell us. Never mind now how regression assigns probabilities, just keep in mind that it does.
Now we might also measure a person’s biological sex speculating males and females will answer the question differently. Or perhaps older or younger people answer differently, so we measure age. Education might play a role. And so forth. There is no limit to the number of things which might cause a person to choose his answer, and indeed something (or things) causes each person to pick his answer. But regression is not (or not usually) a causal discovery model. It is merely correlative. The idea is to measure just those characteristics—call them x’s—which change our minds about y. That is, if the probability y takes a certain value changes knowing a person has this rather than that value of a characteristic, then that characteristic is important to understanding y. If there is no change in the probability of y varying the characteristic, then the characteristic isn’t important.
Regression is supposed to be this: given a particular value of each of the x’s in our “regression model”, regression gives us the probability y takes the values it can take. That’s it; that’s all regression is, or that’s all it should be. Statements of results should concentrate on how much, if at all, each of the x’s change the uncertainty in the y. Causative language should be minimal and cautious.
Before (next time) we get to our main example, regression is meant to be this. Suppose all we had in our model was sex, male or female. Y can take the values 1, 2, …, 10. Given a set of observed data and knowing a person is a male, the regression should tell us the probability this male answers y = 1, y = 2, … and y = 10. Then knowing a person is a female, the regression should again tell us the probability this female answers y = 1, y = 2, … and y = 10. If these two sets of probabilities are exactly the same, then knowing a person’s sex is irrelevant to knowing their Hate score. If the two sets different for any of the levels of y, then something about sex, or something associated with it, is relevant to understanding the uncertainty in the score.
Regression is thus a prediction. It tells us the probability of “events” not yet seen. Consider we do not need regression, or any statistical method, to tell us about the data we have observed, because—can you guess?—we have observed that data. Except in those instances where data is measured with error, we know everything there is to know about that data. If we want to know how many men (of any kind) had Hate scores greater than 5, all we have to do is count. We don’t need to “estimate” anything—except for that which is still hidden from us, like the future.
Simple as that. So why are regression results never stated in this form?
Next time: examples.