Saw this tweet this fine morning:

Regression & Causation in Econometrics Textbooks – Dave Giles http://t.co/RaenETSW

— Mark Thoma (@MarkThoma) December 6, 2012

and downloaded the paper at the landing site, co-authored by Judea Pearl, he of *Causality* fame (a recommended book). Pearl and a co-author (grad student?) took six economics books from the shelves and pondered what those books said about causality. In the non-headline of the week, they discovered the books don’t agree about what regression and structural equations mean regarding causality. And so they wrote that paper (who said there were too many papers?).

I blame us statisticians for the enormous misunderstanding of this topic. We teach this subject badly; too much focus on the math and not nearly enough on the interpretation. Here’s what Chen and Pearl open with:

Assuming that the relationship between the variables is linear, the structural equation is written Y = βX + ε. Additionally, if X is statistically independent of , often called exogeneity, linear regression can be used to estimate the value of β, the “effect coefficient”…

If the assumptions underlying the model are correct, the model is capable of answering all causally related queries, including questions of prospective and introspective counterfactuals.

Ignore the jargon if you don’t know it; it’s not important for us. Key is the “structural equation model” (SEM), also called a linear regression (they expand the simple to the matrix form, which doesn’t change what follows one whit). Here’s the equation again: Y = βX + ε, where Y is the thing we want to know about (“dependent” variable, outcome, etc.) and X is a thing thought to be associated with Y (“independent variable”, “regressor”, the “data”, etc.).

First thing to notice: **this equation is wrong**. Yes, and in a sad way, too. Oh, sure, it’s right in the sense that it’s shorthand for the proper equation, and math is filled with shorthand notations. Problem is, people who use this equation quickly forget it is shorthand and start thinking it’s the whole thing. And that’s when people become confused about causality and everything else.

The real, full-blown equation (in its simplest complete form) is this:

Y ~ Normal(μ, σ)

μ = βX

σ = f(X; ε) (**update** fixed typo here)

We start with some real, observable thing: a Y. We announce to the world “My uncertainty in Y, the values it will take, is quantified by a normal distribution indexed by the parameters μ and σ.” Those *parameters* themselves are modeled by the Xs in the manner shown (there are various ways of handling σ, which are not of interest to us, hence the fudge of writing it as a function of X). I’m assuming you know what a normal distribution is and what it looks like (many other distributions are possible).

In other words, the regression is modeling how the uncertainty in Y changes when X changes. The regression equation is *not* telling us how Y changes when X changes: I repeat that it tells us how the *uncertainty* in Y changes when X changes. When X changes, μ and σ change; in effect, we draw new normal distributions to represent our uncertainty in Y—*not* in Y itself, but I emphasize yet again, the *uncertainty* in Y—for every new possible value of X.

Just as a for example: if X = 0, then our uncertainty in Y is modeled by a normal distribution centered at 0 with some plus or minus specified by σ. When X = 1, then we shift our uncertainty in Y by β (to the right if β is positive, to the left if β is negative). Got it?

Now one notion of causality suggests that if you pull lever X, gauge Y changes by an exactly specified amount: *every time* X is in position x, Y is in y. Every time. If X = x, Y = y; Y never equals y + ε, nor does it equal anything else but y. In this sense, the regression/SEM is *not* a causal model. How could it be? Notice how I repeat *ad nauseum* that the regression specifies our *uncertainty* in Y? If the value of Y when X = x is at all uncertain (if σ > 0), then the model cannot be causal. X might move the bulk of Y—X might be an incomplete part of the causal chain—but there is *something else* (who knows what) *causing* Y to depart from y.

A second notion of causality insists that when X = x Y is caused to be y, but then “external forces” *cause* Y to depart subtly from y. This notion includes measurement error. Y really is y, but our imperfect ways of measuring make it appear Y shifts a tiny bit from y. In other words, something (the process of measurement) *causes* Y to leak away from y. Notice we do not in this instance have direct empirical evidence that Y = y; we merely insist it is (in an act of faith?).

The second kind of causality is (almost necessarily) the kind one finds in real-life things: chemical models, behavior of photons, and so forth. It shows up when we have good additional external evidence that Y should be y when X = x. It is this external information which insists that our uncertainty in Y takes a normal distribution (or whatever).

This notion is sometimes *misapplied* in models where the amount of uncertainty is larger, where the additional external evidence is weak, incomplete, or is itself wrong or uncertain. Climate models, medical models, and so forth are good examples. Here the models are nearly entirely statistical: rules of thumb set to equations. There is no complete set of premises from which we can deduce that our uncertainty in Y is represented by a normal distribution. We just *say* it is (from habit, or whatever). It is here that we find the deadly sin of reification. The first kind of causality is found inside computers. I press Y on the keyboard and see a Y on screen.

Some statistics textbooks get the equations above right, but they usually abandon the right way for the shorthand, because the shorthand is compact. Who the hell wants to write “the uncertainty of Y” every time instead of just “Y”? Problem is, once you drop it, reality becomes fuzzy, and you suddenly find yourself speaking of “residuals” as if they were real creatures. About those, another day.

6 December 2012 at 12:01 pm

You make your guess, spin the wheel, and where she stops nobody knows until she stops. Even then, you don’t know where she stopped or why. Oh but you do have the numbers, the equations, the endless calculations, and a mountain of unstated assumptions to be able to fill paper after paper after paper after … each paper being the cause of the next. Reality? We don’t need no stinking reality. We have PAPERS!

6 December 2012 at 12:32 pm

Forgive the really stupid question but, given what you’ve said, what’s the point of regression?

6 December 2012 at 12:52 pm

Jonathan,

It’s a good question. Short answer is: to quantify uncertainty. Not all uncertainty should be quantified, but suppose you want to. Suppose we’re interested in quantifying the uncertainty of the temperature of July afternoons in beautiful Charlevoix, Michigan. We might use a normal distribution with guesses for the two parameters. If we accept these guesses are error free, then we could say things like, “The temperature of July afternoons is likely to be between this and that number.” Of course, this is only a crude approximation; also bearing in mind that no real thing normal is “distributed” normally. But close enough to play with.

Enter the regression. We might think cloudy versus sunny days are different, temperatures wise. So X = “Is it cloudy?” Then we would have two normal distributions to quantify our uncertainty in temperature, one for sunny and one for cloudy days.

Problems arise by people forgetting the parameters are guesses and not observables, that certainty in the values of the parameters does NOT equate to uncertainty of the actual Ys (the temperature here), and other things.

But regression can be useful. And here, too, it can say something (NOT everything) about causality, but only because we have external evidence about how clouds and temperature work.

6 December 2012 at 2:07 pm

Excelent explanation of regression. I became involved in curve fitting in the 1970s and always looked at it as an optimization problem, not a statistics problem. For example, suppose you want to fit a straight line, Y=Ax+B, to some X,Y data points. You want to select A snd B to minimize the error in the curve fit some sense. Each data point will have an associated error, Ei=Yi-(A+BXi), so you have an error vector (E1,E2,…..En) and you want to select A and B to minimize a norm of the error vector. This is an optimization problem. In optimization you are trying to minimize a penalty function. Statisticians believe the Euclidian norm is what should be minimized. It was my exrerience that minimizing the absolute error gives a better curve fit to really noisey data, minimizing the squared error is good for most data, but if you have really good low dispersion data , a minimax error criteria gives the best fit. This is what I saw (eyeball observation) and I have no theoretical justification for this.

6 December 2012 at 4:47 pm

Suppose X and Y are normally distributed random variables with mean 0 and 1 respectively. Both have a variance of 1.

Saying my uncertanty my expectations Y is greater than my uncertanty in my expectations X doesn’t sound right. My expectations are different, but the degree of uncertanty is about the same.

Okay, thought number 2…

X and Y are random variables as described above. Further suppose that Y is correlated with X and the correlation coefficent = 0.5

Then can I not say that:

Y = 0.5*X + Z where Z is a normally destributed random variable, indpendent from X and Y with mean 1 and variance 3/4.

Finally,

If you run a linear regression and your data shows that

σ ~ f(X; ε)

Then you have a problem…or shall I say, if the error terms of your residual are not themselves normally distributed, and independent from X and Y, then the parameters of your least squared error linear regression calculations are not unbaised estmates of your peramters of uncertanty.. (uncertanty in your uncertanty)

6 December 2012 at 5:05 pm

Doug M,

First, and importantly (and which I hope does not sound pedantic), it is improper to say (though everybody does say) something like “X and Y are normally distributed random variables with mean 0 and 1 respectively. Both have a variance of 1.” Nothing in the universe “is normally distributed.” What we can say is that “My uncertainty in the values X and Y will take is quantified with a normal with parameters 0 and 1.”

You lose with me with point 1. Unless you mean to claim your uncertainty in both X and Y is the same, and then I agree.

Thought 2: another thing never to say is “random variables”. Since random only means unknown, we can’t have “unknown variables.” We can have variables about which we do not know their values, but which we can quantify the uncertainty these variables take certain values.

About your regression equation, it’s really no different than what’s described in the main text, except where instead of “estimating” the ε, you claim to know all about Z, which is fine.

And you caught me in a typo. How the heck did a tilde get in where an equal sign was supposed to be!!?? The correct equation, which I’ll fix in the main text, is

σ = f(X;ε)

Equal sign, not “my uncertainty is characterized by” (the tilde).

6 December 2012 at 7:03 pm

Random varialbes (normally distributed or not) are perfectly legitamate as mathematical objects, and are used to derive the theory of prob and stat. Whether that describes your data is a something else.

Regarding,

σ = f(X;ε)

My point is that if your estimate of sigma is a function of X, then the formula in your excel regression package will lead you toward a false sense of accuracy in your perameters.

6 December 2012 at 7:28 pm

Doug M,

I would only change that to read “Variables are perfectly legitimate, etc. etc.” The “randomness” is not needed and tends to confuse and to improperly invoke a sort of mysticism.

6 December 2012 at 9:28 pm

In other words, something (the process of measurement) causes Y to leak away from y. Notice we do not in this instance have direct empirical evidence that Y = y; we merely insist it is (in an act of faith?).And yet, (some time ago) when y was a perfect triangle universal and by insistence Y was its imperfect instantiation, no leap of faith was being made.

Reality? We don’t need no stinking reality.Indeed. All we really know are our models of reality some of which seem to be “good enough” approximations.

7 December 2012 at 5:07 am

Okay, I find myself moving on slightly but I’m so mixed up about this (and I’m a teacher of stats to pre-university students). In the courses I teach, regression is presented as curve fitting. For instance; the consumption is suggested to be a linear function dependent on income. So, you get a bunch of data to work out something like C=2.3Y+4.3. Okay, now we recognise that this is a sample of data and that we can do tests on the values 2.3 and 4.3 to get confidence intervals (my students don’t have to get that far – just find the line of best fit -phew). Excel or SPSS will knock out these figures easy peasy.

7 December 2012 at 5:12 am

Sorry ran out of space. Anyway, it seems to me that regression is presented like this higher up the education system. You plug in a bunch of data that regression fits to a curve, so this gives you a nice equation to predict stuff. But, from my own poor grasp of this (how do you know what type of curve to use?) and from this discussion, it seems that this is much harder than that.

Folks, sorry to ask such basic stuff but this is really helping me get a handle on this. Thanks

7 December 2012 at 9:22 am

“About those, another day” – will that be the same day you do a “new statistics” calculation showing all the steps end-to-end?

7 December 2012 at 9:30 am

Rich,

A lot (all?) of that stuff is in my (free!) book, and code for simple examples is on the book page.

7 December 2012 at 11:24 am

Ah yes. I followed all that right up to the ‘scenarios’ where I got lost.

What I’m really asking is that you put up a blog post that contains a degree course in statistics that’s really easy to follow.

It’s OK. I can wait….

7 December 2012 at 12:05 pm

Any comments on this new study, “Detecting Causality in Complex Ecosystems”, which proposes new approach to detect causality (http://www.sciencemag.org/content/338/6106/496.abstract)

7 December 2012 at 12:20 pm

Johnathan Andrews,

I agree that that is how regression is taught in school, and largely how it is used in industry.

But then the next question is why are you fitting this curve to the data? Usually because you intend to build some kind of model. Not only do I believe that X has an influence on Y, but I can observe X, and I cannot so easily observe Y. I will build a model that will predict X based upon Y.

Great. Regression allows you to calibrate this model.

Once you have done the regression and built the model, the question is, is this model any good? Will it be usefull? That is where things get tricky.

Are the regression co-efficients sufficiently different from zero?

Is the “R squared” sufficently different from zero?

How big is the error.

Create a scatter plot, does it look funny? Are there outliers?

If you decide to exclude an outlier, think long and hard about why you are doing it.

Plot the error… is the size of the errors correlated with your data ser

7 December 2012 at 12:31 pm

strange, browser went funny on me…

are the errors correlated with your data. If so, you are likely missing something.

Which model to use… apply Occam’s Razor. Linear is better than quadratic. Single variable is better that muti-variable.

Always start with a premise. Never find a correlation and try to rationalise why it might be. If you run enough regressions, you will eventually find a spurrious correlation.

More data is not neccessarily better.

Okay, so you have a model, you think it is reasonable. Next step is to check it with out of sample data.

Okay, last piece, your model is just a model, don’t get it confused with reality. Your model will never be as good of a predictor as it seemed to be in the lab.

7 December 2012 at 12:34 pm

Typo in my post from 12:20…

Not only do I believe that X has an influence on Y, but I can observe X, and I cannot so easily observe Y. I will build a model that will predict Y based upon X.

7 December 2012 at 5:32 pm

A couple of points:

In Econometrics, we always make a complete statement of the regression model as you note, and draw a diagram or two showing how the estimated mean value of y changes when x changes, and its distributions at the various values of X. But of course this is not continually repeated, and no doubt the students do not fully internalize it.

You say:

“If the value of Y when X = x is at all uncertain (if σ > 0), then the model cannot be causal. X might move the bulk of Y—X might be an incomplete part of the causal chain—but there is something else (who knows what) causing Y to depart from y.”

In Econometrics, we try to have a theoretical reason that tells us the causal relationship between X and Y. E.g., for a demand function, quantity demanded depends on price, income, and other specified variables affecting decision making. Do you not mean that the regression captures the estimated change in the MEAN of Y when X changes? It can still be causal if the error term reflects only random influences and if we are focused on mean responses. Further of course, we specify that the variance of Y is constant throughout the relevant and measured range of X. If it is not, then we acknowledge that a different specification is needed, one that also captures the systematic changes in the error term.

7 December 2012 at 6:00 pm

Matt, thanks for stating clearly that regression is about predicting the uncertainty of Y based on the given value of X. Regression is just one more descriptive modeling tool that can be applied to any kind of data, including completely fictional data, so of course it cannot say anything about causality regarding X and Y. For causality determination, you have to understand the biochemistry, physics, or insurance law etc. underlying the data, as Matt pointed out in one of his replies above.

Regarding Ray’s remarks above, regression can indeed be seen as an optimization process, where some measure of the spread of Y around the regression curve is to be minimized. In least-squares regression, it is the average variance of Y over all values of X that is being minimized. Other cost functions are considered by statisticians for regression as well.

Notwithstanding the statistician’s disapproval of ‘random’ in ‘random variable’, mathematicians define ‘random variable’ to mean ‘a measurable function from a probability space into the real numbers’. The ‘random’ part refers to the probability measure on the domain of the function — it does not have anything to do with mysticism or folk notions of randomness, just some very lovely math.

7 December 2012 at 6:38 pm

you and giles and pearl are all an example of what is wrong with math and statistics and economics.

Its a simple linear equation; after you get it, you have do some experiment to show the mechanisms that explain the relation.

This idea that you can get something valuable by data dredging is why econ and stats are not important in the real world

less math, more doing

13 December 2012 at 2:21 pm

All monotonic time series that occur at the same time are correlated -positively or negatively. If you allow time shifting (e.g. temperature rises before CO2) or time scaling (retail prices of gasoline fall slower than they rise in response to crude oil prices) then (roughly speaking) anything is correlated with anything. The notion that causality is involved in any way by data manipulation is ridiculous.

Regards,

Bill Drissel

Grand Prairie, TX

15 December 2012 at 4:02 pm

WMB,

Given the paper you’re explicitly referring to makes a point to review major statistics textbooks, I am very curious to know how you weight in on that debate. Is there a textbook you use or recommend that would be fitting for the social sciences? Thanks in advance.

18 December 2012 at 1:39 am

Dear William,

A colleague brought your blog to my attention, together

with your post on my paper with Bryant Chen, which reviews

six econometric textbooks, see

http://ftp.cs.ucla.edu/pub/stat_ser/r395.pdf

You say that

“The equation Y = beta x + epsilon is WRONG,”

“and in a sad way, too.” because it is a shorthand

for the full-blown bi-variate distribution of X and Y.

I think that the difference between your interpretation of

the equation and the interpretation that economists attribute

to it is much deeper than short-hand versus full specification.

Economists, since the time of Haavelmo (1943)

have taken the structural equation Y = beta x + epsilon

to mean something totally different from regression,and this something

has nothing to do with the distribution of X and Y.

And I literally mean NOTHING; structural equations in economics

are distinct

mathematical objects that convey totally different information

about the population and, in general, do not even constrain

the regression equation describing the same population.

I discuss it in this paper:

http://ftp.cs.ucla.edu/pub/stat_ser/r391.pdf

I am curious to know if Haavelmo’s distinction

is common knowledge or comes as a surprise to readers

of your blog.

18 December 2012 at 10:46 am

judea pearl,

Thanks very much for stopping by! Also for the link to the paper, which I’ll take a look at. Though sir, if the model means “NOTHING” then why use it? I take it you mean “NOTHING” in some different sense? I’ll have to see the paper, of course.

18 December 2012 at 5:37 pm

William,

The model says NOTHING about the joint distribution, but

it says a lot about the phenomenon that underlies the distribution.

It so happens that policy makers are more interested in the latter.

18 December 2012 at 6:07 pm

judea pearl,

About the joint “distribution” is another thing which is, I agree, not of much interest. And of course I agree completely the model quantifies uncertainty in the Y, given the X, old observations and assuming model true, and how that uncertainty changes in Y when X (or the Xs) change. Whether this quantification accords with reality is, as you know much better than I, a different matter.