Saw this tweet this fine morning:
and downloaded the paper at the landing site, co-authored by Judea Pearl, he of Causality fame (a recommended book). Pearl and a co-author (grad student?) took six economics books from the shelves and pondered what those books said about causality. In the non-headline of the week, they discovered the books don’t agree about what regression and structural equations mean regarding causality. And so they wrote that paper (who said there were too many papers?).
I blame us statisticians for the enormous misunderstanding of this topic. We teach this subject badly; too much focus on the math and not nearly enough on the interpretation. Here’s what Chen and Pearl open with:
Assuming that the relationship between the variables is linear, the structural equation is written Y = βX + ε. Additionally, if X is statistically independent of , often called exogeneity, linear regression can be used to estimate the value of β, the “effect coefficient”…
If the assumptions underlying the model are correct, the model is capable of answering all causally related queries, including questions of prospective and introspective counterfactuals.
Ignore the jargon if you don’t know it; it’s not important for us. Key is the “structural equation model” (SEM), also called a linear regression (they expand the simple to the matrix form, which doesn’t change what follows one whit). Here’s the equation again: Y = βX + ε, where Y is the thing we want to know about (“dependent” variable, outcome, etc.) and X is a thing thought to be associated with Y (“independent variable”, “regressor”, the “data”, etc.).
First thing to notice: this equation is wrong. Yes, and in a sad way, too. Oh, sure, it’s right in the sense that it’s shorthand for the proper equation, and math is filled with shorthand notations. Problem is, people who use this equation quickly forget it is shorthand and start thinking it’s the whole thing. And that’s when people become confused about causality and everything else.
The real, full-blown equation (in its simplest complete form) is this:
Y ~ Normal(μ, σ)
μ = βX
σ = f(X; ε) (update fixed typo here)
We start with some real, observable thing: a Y. We announce to the world “My uncertainty in Y, the values it will take, is quantified by a normal distribution indexed by the parameters μ and σ.” Those parameters themselves are modeled by the Xs in the manner shown (there are various ways of handling σ, which are not of interest to us, hence the fudge of writing it as a function of X). I’m assuming you know what a normal distribution is and what it looks like (many other distributions are possible).
In other words, the regression is modeling how the uncertainty in Y changes when X changes. The regression equation is not telling us how Y changes when X changes: I repeat that it tells us how the uncertainty in Y changes when X changes. When X changes, μ and σ change; in effect, we draw new normal distributions to represent our uncertainty in Y—not in Y itself, but I emphasize yet again, the uncertainty in Y—for every new possible value of X.
Just as a for example: if X = 0, then our uncertainty in Y is modeled by a normal distribution centered at 0 with some plus or minus specified by σ. When X = 1, then we shift our uncertainty in Y by β (to the right if β is positive, to the left if β is negative). Got it?
Now one notion of causality suggests that if you pull lever X, gauge Y changes by an exactly specified amount: every time X is in position x, Y is in y. Every time. If X = x, Y = y; Y never equals y + ε, nor does it equal anything else but y. In this sense, the regression/SEM is not a causal model. How could it be? Notice how I repeat ad nauseum that the regression specifies our uncertainty in Y? If the value of Y when X = x is at all uncertain (if σ > 0), then the model cannot be causal. X might move the bulk of Y—X might be an incomplete part of the causal chain—but there is something else (who knows what) causing Y to depart from y.
A second notion of causality insists that when X = x Y is caused to be y, but then “external forces” cause Y to depart subtly from y. This notion includes measurement error. Y really is y, but our imperfect ways of measuring make it appear Y shifts a tiny bit from y. In other words, something (the process of measurement) causes Y to leak away from y. Notice we do not in this instance have direct empirical evidence that Y = y; we merely insist it is (in an act of faith?).
The second kind of causality is (almost necessarily) the kind one finds in real-life things: chemical models, behavior of photons, and so forth. It shows up when we have good additional external evidence that Y should be y when X = x. It is this external information which insists that our uncertainty in Y takes a normal distribution (or whatever).
This notion is sometimes misapplied in models where the amount of uncertainty is larger, where the additional external evidence is weak, incomplete, or is itself wrong or uncertain. Climate models, medical models, and so forth are good examples. Here the models are nearly entirely statistical: rules of thumb set to equations. There is no complete set of premises from which we can deduce that our uncertainty in Y is represented by a normal distribution. We just say it is (from habit, or whatever). It is here that we find the deadly sin of reification. The first kind of causality is found inside computers. I press Y on the keyboard and see a Y on screen.
Some statistics textbooks get the equations above right, but they usually abandon the right way for the shorthand, because the shorthand is compact. Who the hell wants to write “the uncertainty of Y” every time instead of just “Y”? Problem is, once you drop it, reality becomes fuzzy, and you suddenly find yourself speaking of “residuals” as if they were real creatures. About those, another day.