Regression Examples. Normal Distributions Often Stink. Ithaca Teaching Journal, Day 10

Today, just one example, and the simplest kind, to show you that using regular regression, with its assumptions of “normality”, can quickly lead to absurdities—absurdities which will pass unnoticed using classical methods.

This is a real example, using real data, an actual regression used operationally. The data could not be simpler: time to a certain event for one of five departments (here labeled A – E; the source of the data is confidential). Times to events, of course, must be greater than 0. The question is whether different departments have different times, but hoping that the two presented below would not.

This example is typical of one carried out everywhere, every day. I am not justifying the use of regression here—better models exist—but I am saying what would happen if it were used, as it was. As are similar models used routinely.

An examination of the boxplots of the times for each department would show roughly “normal” data. There is nothing untoward in the data (to the eye of the typical user). This is classical analysis of variance, a.k.a. ANOVA, which is equivalent to linear regression. Here’s the ANOVA table from the R glm() function:

Estimate Std. Error t value Pr( > |t|)
(Intercept) 3.4e+01 9.7e-01 3.5e+01 5.9e-131
DepartmentB -2.6e+01 3.5e+00 -7.3e+00 1.8e-12
DepartmentC -8.4e+00 1.6e+00 -5.2e+00 3.3e-07
DepartmentD -3.1e+01 5.9e+00 -5.1e+00 4.2e-07
DepartmentE -3.1e+01 3.8e+00 -8.1e+00 5.6e-15

We needn’t say much about this except that each p-value is pleasingly small, i.e. publishable, i.e. less than the magic number. Null hypotheses aplenty were rejected. The person writing up the results said that there were “statistically significant differences” between the departments. That there were differences we already knew: for one, the departments are different! For another, we could have just looked at the data we took and saw there were differences. We didn’t need a model for that.

Now look at this:

Predictive Posterior

The first five plots are the posterior distributions of the parameters of the model. Each has a probability near 1 that the parameter is far from 0, i.e. high probability that the parameter “belongs” in the model. Thus, whether frequentist or Bayesian, one would say that, yes, clear differences in “mean” times could be seen between departments.

However, gaze at the bottom right picture. It shows two distributions (actually “densities”) for our uncertainty in future times for the departments B (solid line) and C (dashed line) given the old data and assuming the model we used is true.

Assuming a true model and old data, we can calculate the probability that future (not yet seen) values of time in department C will be greater than times in department D: this probability is 78%. If our uncertainty in the values of times for both departments were the same, the probability that times in department C would be greater than times in department D would be 50%. Thus, just like the classical analysis, the modern would indicate that knowledge of department is relevant to our knowledge of the uncertainty in times. So all is well.

Except take a closer look. Our uncertainty in future values of times in department B indicates there is about a 40% chance for times less than 0, i.e. values which are absurd. This is a huge error, but one commonly seen.

The fault lies in assuming the times were “normal.” They were not. No thing is. What we assumed was that our uncertainty in the times could be characterized (or quantified) by a normal distribution, the central parameter of which was allowed to vary between departments: that’s our model. A logical consequence of this assumption is that bottom-right picture.

Understand, what that picture shows is true assuming our premises, i.e our model, is true. We cannot learn from that picture that the model is absurd unless we look outside the model, as we did when we recalled that times less than 0 are impossible.

You might object, “Look, we wanted to say whether there were differences. We have both small p-values and high posterior probabilities. So what if the predictive distribution is ridiculous? I have what I wanted.”

But those p-values and posteriors are also conditional on the model’s truth. Since, upon accepting the premise that times less than 0 are impossible, we know the model is false and not even close to a good approximation, we have good evidence that those p-values and posteriors are also bad, i.e. misleading, i.e. false. You are too sure of your conclusions.

Besides, if all you wanted was to say there were differences, all you had to do was look: yes, the box plots of times between C and D were different. Accepting this, the probability they were different was 1, i.e. 100%, i.e. it is true there were differences. What better evidence could you want?

“But how could I know whether those differences arose by chance? That’s why I had to use regression. P-values and posteriors can tell me whether the differences I saw were real or were due to chance.”

Did you think the differences you saw were unreal? Something caused the differences we saw. What? “Chance” isn’t alive, Chance isn’t a mystical entity, small-c-chance isn’t a cause. But if we could identify the actual causes, then, once again, we would know with certainty not only there were differences but why those differences arose. It’s only because we don’t know the causes that we had to resort to characterizing our uncertainty using a probability model.

Since we did see differences, the only question is whether those differences will persist (if we continue observing data). And we can’t know that unless we characterize our uncertainty in the times. We did this using regression, where we allowed the central parameter of a normal distribution to vary based on department. But we saw that, even though assuming this model true, we believe there is a 78% chance differences would persist, we also had solid evidence that the model is false. What we should do is begin again and better characterize our uncertainty in the times, i.e. come up with a better model. And that is all we can say with sufficient certainty for now.

Incidentally, I did not have to ask what is the probability that times in C would be greater than B: I could have asked any question that was important to me. Anyway, everything I know about future values of C and B is shown in that picture.

This is only one example. I have not exhausted by far all the ways the classical ways lead to over-certainty and mistakes.


  1. Anyway, everything I know about future values of C and B is shown in that picture.
    Are you happy saying you know something you’ve shown is absurd?

  2. World’s Worst Statistician: “Department C, we are [1 minus 1.8e-12]% times 78% sure that you are slackers. This warrants a 5% pay reduction. Department B, we are [1 minus 3.3e-7]% times 40% sure you can bend the laws of space and time. This warrants a 5% pay increase. Now get back to work, both of you!”

  3. In general I agree with what you’re saying, but to be fair, the classical approach you present here isn’t a very thoughtful one. The textbook approach would be to transform the data so that they look more “normal”: the fact that the data are positive is a red flag.

    One approach I’ve used successfully in this type of situation is a randomization test: the null hypothesis is that the distribution of times is the same for all departments. If H0 is true, we can randomly reshuffle the data many times, re-compute the parameter of interest (such as a contrast) for each sample, and thereby compute a P-value for the observed value of the parameter. This is simple to implement in R, generally easy for people to grasp, and usually has adequate power even for modest sample sizes.

  4. To be fair also, the simple point that poor modeling assumptions can lead to wrong conclusions, i.e, garbage-in-garbage-out, holds for both MLE/classical and Bayesian methods.

  5. I am having a tough time understanding the problem. If I were to see the data, it would be obvious as to why the normal assumption is a bad one. But, if the data is confidential, we are at an impasse.

    If non-zero t is an impossibility, my first thought would be to model uncertanty as log-noramally distributed. That may not be a valid assumption, but the math is easy.

  6. All who commented today,

    Please come back tomorrow (you too, JH; especially you Rich and Jeff).

  7. I only come here for the humorous stuff, not the stats.

    But as Mann continues to show by using upside down Tiljander data series, the sign doesn’t matter.


  8. Doug M,

    You’re of course welcome to try out this example on your own data—the code is on the “book” portion of the site. How to do it is in my class notes (which is free).

Leave a Comment

Your email address will not be published. Required fields are marked *