The Biggest Error In Regression


I know, I know: we’re sick to death of regression, but we have to cover the biggest error, which is how the Deadly Sin of Reification happens.

You have a “y”, an outcome, some number the uncertainty in which you want to quantify with a normal distribution. We do not say “y” is “normally distributed.” That’s the beginning of the sin. It is our uncertainty in and not the “y” itself that the normal represents.

The normal has two parameters, a central m and spread s. The central parameter m is modeled as a function of “x”s, like this:

     m = b0 + b1x1 + … + bpxp.

Now most textbooks get this equation right. And then immediately forget it. Almost directly after the equation is introduced, the author—and perforce, the students—substitute this:

     y = b0 + b1x1 + … + bpxp.

The central parameter is forgotten and the equation is said to represent the observable itself. The model has become the reality. The Sin has been committed, because why? Because if this second equation holds, it is causative. It says, “When I change xp by such-and-such amount, y itself is caused to changed by bp.”

This is false. This is not so. This is wrong. This is foolish. Plus, it’s not right.

Probability models are not causal. They only represent uncertainty. They do not say what happens to the “y”, only what happens to our uncertainty in the “y”.

The second equation is often given in the form you see it; I mean, it is written out like that. But often it is given just in the words. Suppose “y” is “white blood count” and I want to quantify my uncertainty in its value using a normal distribution. I should use the first equation. One of the “x”s might be a person’s “body temperature.”

If it turns out the “b” associated with “body temperature” is positive, it means that, given the other “x”s, the central parameter of “white blood count” increases by “b”—if it is 100% certain “b” is positive. Of that, we typically don’t know. But that is a problem for another day. For now, assume we do know “b” with certainty. In this case, we now say the probability for higher “white blood counts” has increased.

Yet if I were to follow standard procedure, I would immediately say that higher “body temperatures” cause higher “white blood counts.” I would say (wrongly), “Increasing body temperatures drives up white blood.” I might add the hopeful escape clause “on average”, in an attempt to alleviate the Sin. This does not fix the mistake. The reification has happened, and cannot be removed.

Our sample data includes many individuals, some with high “body temperatures” some with low, some with high “white blood counts” some with low. What caused each individual’s actual “body temperature” and “white blood count”? Too many things to count. A body temperature is a complex configuration of bones, muscles, blood, energy use, and on and on. Same thing for “white blood count.” There may be some disease or diseases or malfunctions in some of the people which, through various mechanisms, cause higher “white blood counts.” But “body temperature” is not likely to be one of these causes.

The Reification continues in ascribing a theory which “explains” the positive “b”. This theory may be true, or likely true, and sometimes even is, especially in situations where a highly controlled experiment has been run, where every possible (known) causal factor has been accounted for.

But usually the theory is a raw guess.

Take this all-too-typical headline “Restaurant rage: Living in an area with lots of fast food stores can make you impatient and unable to savour things, researchers warn“.

Researchers recruited people on line, asked them in which zip code they lived, looked up in a book how many fast food restaurants where in that zip code, ignoring that land area in zip codes change dramatically, asked the participants some questions, ran a model which showed the answers to the questions and number of restaurants (actually, the ratio of fast food to “normal” restaurants) were “correlated.”

The researchers immediately committed the Deadly Sin of Reification. They said their work shows “that as pervasive symbols of impatience, fast food can inhibit savoring, producing negative consequences for how we experience pleasurable events.”

Now this theory might be true, but in no way has it been proven, or even close to proven. It is also absurd on its face. All other possible explanations of the data were denied, as felt natural after the Deadly Sin had been committed.

Typos are a two-for-one special today only.


  1. I saw something similar yesterday which appears to be another misuse:

    They “discovered” that autism is caused by someone spraying pesticides within a certain kilometer radius. Never mind that the populations of those radii may be different…

  2. So, when can we say “x” caused “y”? Never?

    Being shot in the head often, but not always (and maybe not even on average), leads to death. It certainly seems to be one of the causes of death so can’t we say it is a cause? Would that be Reification?

  3. I think it would better to describe it in terms of a random vector (X_1,..,X_p,Y) assuming that E(Y | X_1=x_1,…X_p=x_p) = b_0 + b_1x_1 + … + b_px_p,
    V(Y | X_1=x_1,…X_p=x_p) = s and that the conditional probability density function of Y given X_1=x_1,…X_p=x_p is Normal.

  4. Way back when I did this kind of stuff, the model itself was often a linear, or linearized, relation between x’s and y. The assumption was that the measurement errors were had a normal distribution, which was reasonable because most distributions look like the Normal distribution when you are close to the average.

    So to me the y = sum bi. xi doesn’t look particularly weird.

  5. I thought the equation yielded m only when the mean values of the X’s were used — provided the equation was linear. Perhaps in industry, having no scientists to confuse matters, we always settled the correlation/cause business by deliberately changing the key X several times and observing the response. For example, casual observation once noted that conversion presses experiencing many feed jams were fed by shell presses with lower temperatures on their coating guns. The correlation was tested by raising the temperatures on the “cold” guns. The jams downstream decreased. But the superintendent, having been schooled in random variation knew to continue for several days, to see if it was a fluke. (Even “bad” presses had had “good” days.) Then he told the lead operator to lower the temperature on the gun. “But the jams will come back,” said that worthy. “I hope so,” said the supervisor, “because if they don’t, it means raising the temperature is not what made them go away.” They did so, and the jams increased. They ran for a while that way. (“But, boss, we’re getting all these jams!” “How long did we have the jamming problem?” “Six months!” “What will a couple more days matter?”) Then they raised the temperature a second time, and the jams decreased a second time. My old boss Ed Schrock used to say, “Turn the problem on and off several times to be sure you’ve identified a causal factor.”

  6. If it is okay to say:
    m = b0 + b1x1 + … + bpxp
    But a sin to say:
    y = b0 + b1x1 + … + bpxp

    Can we say…
    y = m + z
    Where z is the net effect every factor on y that has not been captured by our model?
    And would it be a stretch to call z the error of our model.

  7. Does anybody do probability studies of what causes a piston, in a normally functioning internal combustion engine, to go down?

  8. @Jersey McJones

    International readership here. What’s a Republican? Are they important?

  9. Doug M and Sander van der Wal,

    Mathematical theories dictate that, in the context of basic linear regression with the assumption of normality, the model can be written as such additive error equation. With proper diagnostic tools, one can check if a normal assumption is be appropriate.

    However, in more advanced generalized linear models, the likelihood of P(Y|X,\beta) is postulated directly based on the data structures. ( \beta is a parameter vector… just trying stick to the tradition of using Greek letters to represent parameters.)

  10. Mr. Briggs,

    If it turns out the “b” associated with “body temperature” is positive, it means that, given the other “x”s, the central parameter of “white blood count” increases by “b”—if it is 100% certain “b” is positive.

    Using your notation and in the context of normal linear regression model, the specific coefficient “b” associated with “body temperature” indicates that, holding all other Xs at fixed values, the linear predictor or regression function m changes by “b” when the “body temperature” changes by an additional unit.

    Simple math.

    For example, m = 1 + 2*x1 – 1*x2. The coefficient “b” associated with x1 is +2. Fixing x2 at a given value, the regression function m changes by +2 (i.e., increases by 2) for each unit increase in X1.

    Not a hell-going mistake though.

  11. Now most textbooks get this equation right. And then immediately forget it. Almost directly after the equation is introduced, the author—and perforce, the students—substitute this:

    y = b0 + b1x1 + … + bpxp.

    I found some examples in some posts and comments here. I have quite a few textbooks, I couldn’t find one with such a grave error. So, examples of such textbooks, please.

Leave a Comment

Your email address will not be published. Required fields are marked *