# Statistical Models CANNOT Show Cause, But EVERYBODY Thinks They Can. Hence the Replication CRISIS

*Please pass this on to ANY researcher who uses statistics. Pretty please. With sugar on top. Like I say below, it’s far far far far far past time to cease using statistics to “prove” cause. Statistical methods are killing science. Notice the CAPITALIZED words in the title to show how SERIOUS I am.*

Statistical models cannot discern or judge cause, but everybody who uses statistics thinks models can. I *prove*—where by *prove* I mean *prove*—this inability in my new book *Uncertainty: The Soul of Modeling, Probability & Statistics*, but here I hope to demonstrate it to you (or intrigue you) in short form using an example from the paper “Emotional Judges and Unlucky Juveniles” by Ozkan Eren and Naci Mocan at the National Bureau of Economic Research.

Now everybody—*rightfully*—giggles when show plots of spurious correlations, like those shown at Tyler Virgen’s site. A favorite is per capita cheese consumption and the number of people who died by becoming tangled in their bedsheets. Two entangled lines on a deadly increase! Perhaps even more worrying is the positive correlation between the number of letters in the winning words at the Scripps National Spelling Bee and the number of people killed by venomous spiders. *V-e-n-o-m-o-u-s*: 8.

All of Virgen’s silly graphs are “statistically significant”; i.e. they evince wee p-values. Therefore, if statistical models show cause, or point to “links”, then *all* of his graphs *must*—as in *must*—warn of real phenomenon. Read that again. And again.

All of Virgen’s correlations would give large Bayes Factors. Therefore, if Bayesian statistical methods show cause, or point to “links”, then all of these graphs *must*—I must insist on *must*—prove real links or actual cause.

All of Virgen’s data would even make high-probability predictions, using the kind of predictive statistical methods I recommend, or using any “machine learning” algorithm. Therefore, if predictive or “machine learning” methods show cause, or point to “links”, then all of his graphs *must*—pay attention to *must*—prove cause or show real links.

I insist upon: *must*. If any kind of probability models shows cause or highlights links, then any “significant” finding *must* prove cause or links. Any data fed into a model which shows significance (or large Bayes factor, or high-probability prediction) *must* be identifying real causes.

Since that conclusion is true given the premises, and since the conclusion is absurd, there must be something wrong with the premises. And what is wrong is the assumption probability models can identify cause.

There is no way to know using only sets data and a probability model if any cause is present. If you disagree, then you *must* insist that every one of Virgen’s examples are true causes, previously unknown to science.

Here’s an even better example. Two pieces of data, Q and W, are given and Q modeled on W, or vice versa, give statistical joy, i.e. these two mystery data give wee p-values, large Bayes factors, high-probability predictions. Every statistician, *even without knowing* what Q and W are *must* say Q causes W, or vice versa, or Q is “linked” to W, or vice versa. Do you see? *Do you see*? If not, hang on.

How do we know Virgen’s examples are absurd? They pass every statistical test that say they are not. Just as the flood, the tsunami, the coronal mass ejection of papers coming out of academia pass every statistical test. There is no difference, in statistical handling, between Virgen’s examples and official certified scientific research. What gives?

Nothing. *Nothing* *gives*. It is nothing more than I have been saying: probability models cannot identify cause.

We *know* Virgen’s examples are absurd because knowledge of cause isn’t statistical. Knowledge of cause, and knowledge of lack of cause, is *outside* statistics. Knowledge of cause (and its lack) comes from identifying the nature or essence and powers of the elements under consideration. What nature and so on are, I don’t here have space to explain. But you come equipped with a vague notion, which is good enough here. The book *Uncertainty* goes into this at length.

You know the last paragraph is true because if presented with the statistical “significance” of Q and W no statistician or researcher would say there was cause *until* they knew what Q and W were.

The ability to tell a story about observed correlations is not enough to prove cause. We could easily invent a story about per capita cheese consumption and bedsheet death. We know this correlation isn’t a cause because know the nature of cheese consumption, and we have some idea of the powers needed to strangle somebody with a sheet, and that the twain never meet. *Much* more than a story is needed.

Also, if we *know* Q causes W, or vice versa, or that Q is in W’s causal path, or vice versa, then it doesn’t matter what any statistical test says: Q still causes W, etc.

We’re finally at the paper. From the abstract:

Employing the universe of juvenile court decisions in a U.S. state between 1996 and 2012, we analyze the effects of emotional shocks associated with unexpected outcomes of football games played by a prominent college team in the state…We find that unexpected losses increase disposition (sentence) lengths assigned by judges during the week following the game. Unexpected wins, or losses that were expected to be close contests ex-ante, have no impact. The effects of these emotional shocks are asymmetrically borne by black defendants.

You read it right. Somehow all judges, whether they watch or care about college football games and point-spreads, let the outcomes of the football games, which they might not have watched or cared about, influence their sentencing, with women and children suffering most. Wait. No. Blacks suffered most. Never mind.

Wee p-values “confirmed” the *causative* effect or association, the authors claimed.

But it’s asinine. The researchers fed a bunch of weird data into an algorithm, got out wee p-values, and then told a long (57 pages!), complicated story, which convinced them (but not us) that they have found a cause.

What happened here happens *everywhere* and *everywhen*. It’s far far far far far past the time to dump classical statistics into the scrap heap of bad philosophy.

*I beg the pardon of the alert reader who pointed me to this article. I forgot who sent it to me.*

Say it, Brother Briggs!

I think that what you are saying is this: If x causes (or hinders) y, then it is reasonable for y to be positively (or negatively) correlated to x. But, as logic says, we cannot validly conclude the converse. What you are pointing out is that this inferential fallacy is very widespread.

If I am mistaken, please correct me.

Knowledge of cause, and knowledge of lack of cause, is outside statistics.Well, yeah, but it’s also outside of any tool used in determining causality including logic.

If you perform an experiment, all you’re doing is creating a new variable which is injecting more data to analyze by observing the correlations between the variables. Looks like statistics to me.

Agreed trying to do this with only two variables is silly — especially when abusing a test that was meant to eliminate hypotheses instead of confirming them.

Seems to me that the problem is the misuse of the Hypothesis Test and not that statistics can’t be used in determining cause. Saying it can’t be used is shooting yourself in the foot.

I used to tell people that causality is a deterministic process and statistics deals with random processes, processes where there is more than one outcome and the outcome is unpredictable. Epidemiologists of course claim you can prove causality with statistics. They even have, so called, causal criteria which are mostly logical fallacies.

Can an a/b test identify cause?

You’ve misspelled VIGEN’s name throughout the article. I believe this is statistically significant.

Prais God Halleluja! Let me hear it again!

+Rob — a/b tests are both highly useful and highly wicked at the same time. They can be good, but only if you get very strong indications in one direction. 60/40 is not a strong indication. 99/1 might be.

A/B tests are stacked on each other. Find fine enough a grain and you may be able to keep the bias out, but the grain is never fine enough because individuals are never quite grains.

It doesn’t take too many 99/1 A/B tests to end up with 1 in 10 people getting bad results.

.99^n = .9 ==> log .99^n = log .9 ==> n log .99 => log .9 ==> n = log .9/log .99

=> 11 A/B tests.

Which absolutely proves nothing because bad result is sort of badly defined here.

“Epidemiologists of course claim you can prove causality with statistics. They even have, so called, causal criteria which are mostly logical fallacies.”

– Ray, above

THAT is a good example of the sweeping generalization fallacy (“on steroids”). The above quote is false (along with [for the most part] Briggs’ title to this essay).

Anyone reviewing any of a number of credible references on the topic of epidemiology (presumably the Bradford-Hill criteria) discovers with minimal effort that the “criteria” are not criteria at all, merely guidelines around which further analysis should be conducted (the same is observed in other disciplines). For example:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1513293/

http://www.healthknowledge.org.uk/e-learning/epidemiology/practitioners/causation-epidemiology-association-causation

http://ocw.jhsph.edu/courses/fundepiii/PDFs/Lecture19.pdf

http://www.who.int/bulletin/volumes/83/10/792.pdf

http://www.meatinstitute.org/index.php?ht=a/GetDocumentAction/i/11058

“Statistical methods CANNOT establish proof of a causal relationship in an association. The causal significance of an association is a matter of judgment which goes beyond any statement of statistical probability. To judge or evaluate the causal significance of the association between the attribute or agent and the disease, or effect upon health, a number of criteria must be utilized, no one of which is an all-sufficient basis for judgment.” [EMPHASIS added]

— 1964 Surgeon General Report

The issue — the real issue — is that some researchers either don’t, or won’t, apply statistics correctly, often because they have an incentive to abuse & lie about results (not that they don’t know the tools can’t be properly applied to identify cause(s)).

Academic “publish or perish” is one such motivator, along with a dearth of controls to prevent abuse…which, by creating a situation susceptible to abuse will further encourage those with dubious moral character to exploit the situation out of opportunistic self-interest.

The real underlying issue is integrity — a lack of integrity enabled by structural deficiencies that make abuse easy.

Briggs’ by asserting a cause-effect linkage (ignorance of proper use of statistical tools leads to the conclusion of causes the tools are not really providing) he asserts a solution (education about how to apply the tools correctly). This is ironic because in addressing a cause-effect relationship fallacy his problem/solution is also based on the same kind of cause-effect fallacy:

The overwhelming documentation in fields such as epidemiology that establish–at great length–the statistical tool cannot provide the conclusion of cause, clearly the problem of mis-attribution of causes is not based on a lack of education. The limits of the tool are discussed at great length, everywhere. They’re hard to miss.

While its likely some are still somehow not getting that message and are genuinely ignorant about how much causality statistical tools provide, the dominant issue stems from similarly well-known integrity issues and structural shortcomings known to facilitate abuse.

Note that the sweeping generalizations such as the title & mindset Briggs’ applied, and made by Ray, serve to both address a legitimate issue while at the same time serving to stifle critical logical thinking about other fundamentally relevant, arguably much more relevant, matters.

DAV: A lot of statistical “proofs” involve only two variables. In Virgen’s examples, there is still a null hypothesis—that A does not cause B. There was a wee p value found, and that is what is used to show “cause” in many, many areas of science.

The obvious answer is the researcher already suspects cause and uses that belief to choose the variables and get a significant result. People are bothered when the same technique is used on ideas we pretty much know are not causal. If you can’t interchange variables, there is a problem with the method. If a wee p-value shows the null hypothesis is to be rejected, it should work in all cases, not just ones that a person thinks it should. This indicates there is much more to the whole “experiment” than just the statistical wee p-value.

From the andrewgelman.com blog on this subject, a commenter wrote:

The p-value does not tell you if the result was due to chance. It tells you whether the results are consistent with being due to chance. That is not the same thing at all.

A lot of statistical “proofs” involve only two variables. In Virgen’s examples, there is still a null hypothesis—that A does not cause B.Well, yes and no. The idea behind the Hypothesis Test is that if X causes Y then X & Y must be correlated. No correlation means X cannot cause Y.

This was the original intent of the Hypothesis Test: to rule out nonviable hypotheses.

Somehow, because it was used to filter publications, perceptions about what it does and its purpose have changed. This is really what Briggs is talking about.

It’s literally impossible to determine causality by observing the interaction of only two variables.

The p-value in the Hypothesis Test tells you nothing about the hypothesis. What it is telling you is the likelihood that the model parameter value in question (usually the slope, which when non-zero indicates correlation) would not disappear in subsequent trials. I think you more or less said this but I repeat it here to be sure.

Finding a correlation, of course, only means you can’t rule out that X causes Y (or the reverse).

The p-value does not tell you if the result was due to chance. It tells you whether the results are consistent with being due to chance. That is not the same thing at all.Yeah. Frequentists have to dance around calling it a probability. Then there’s that Chance guy meddling again.

A better way to look at it is that is a measure of the likelihood the results are repeatable with a 1-p probability (assuming, of course, normally distributed tests). In most studies this is a ridiculous measure of quality.

The idea behind the Hypothesis Test is that if X causes Y then X & Y must be correlated. No correlation means X cannot cause Y.Or else it means that your X data does not have a wide enough range.

Or else it means that there is a Lurking Third Variable Z.

I ran into an example in which there was no correlation between tablet weight and tablet potency, which defied chemistry and physics. The reason was that the tablets had been pressed from two different bulk batches, and the two batches had been mixed with slightly different amounts of the active ingredient. Hence, tablets of the same weight would have different potencies, depending on which batch they had been pressed from. Heavier tablets from one batch would be less potent than lighter tablets from the other.

There’s more to a two-sample t-test than running the numbers through a computer.

http://www1.cmc.edu/pages/faculty/MONeill/Math152/Handouts/Joiner.pdf

Ken,

To use as evidence official admonitions that probability cannot show cause as proof that people believe and act as if probability cannot show cause, is like Hillary pointing to the existence of federal election laws barring collusion as proof of her innocence. What people say and what they do are not always the same.

But I can see I have to say more, though I don’t want to leave it in a comment.

Hello,

I can agree that classical statistical methods, such as statistical hypothesis testing using P-values, cannot identity causes. It’s a straight up fallacy to say that because P(E|~H)P(E|~C). But we intuitively know that this Bayes factor isnt good enough to establish the truth of P. But this example doesn’t show that probability models are flawed at discovering causes. Rather, we can use Bayes’ theorem to show precisely why the cheese example is absurd. It is absurd because the prior probability of C — P(C) — is very low. So P(C)/P(~C) << P(E|C)/P(E/~C). So Bayes factors are not enough. We also need to consider prior probabilities. Is this what you means when you says we need to understand the powers of the elements under consideration. GIVEN what we know about the powers of cheese, we know that P(C) is low.

But surely statistics has at least something to say about the powers and characteristics of things. We gain our understanding of what sorts of things cheese can and cannot do through our experience with cheese. And experience is frequency data. Through our frequency data (i.e., experience) we discover the powers and charactertics of the things we encounter in our life. Of course, there are non-empirical factors that we also use when choosing between hypotheses (e.g., simplicity), but that doesn't mean that statistics plays no role in helping us identify causes.

I get the impression that p-values etc are not held in high esteem hereabouts. That they are widely mis-used and mis-understood does not strike me as sufficient reason to throw them out of the toolkit. On the other hand, the hours I spent getting my head round them, and getting them into a sensible perspective, makes me feel quite attached to them.

Ron,

Nope. There is no such thing as “P(C)”. It doesn’t exist. No unconditional probability exists. You must have some P(C|E) given some evidence.

Now that’s closer, because we can have all kinds of evidence that says the causal connection is absurd, which indeed we do have. But that doesn’t mean P(C|E) is quantifiable.

It’s not that Bayes is broke (which is impossible), and indeed it is precisely P(C|E) we use to point out the absurdity. But if you were some researcher who used a “flat” or “noninformative” (an impossibility) “prior” on some continuous parameter, he’d get a large Bayes factor. That’s the kind of thinking that is common.

The result is we must look outside statistics, not (technically) that we don’t use probability.

John Shade,

P-values have no good uses. Toss them on the scrap heap of Bad Science. (I prove this in the book, too.)

Briggs you grievously misrepresented that story about the judges and the ball games. Of course people are effected by certain things in their environment, of course most all judges have higher educations and of course a large number of them would be college ball fans and of course their teams winning or losing could effect them. But that ridiculous “all” argument. Just had to go to the nth with it, huh?

You

canget closer to cause with statistical models, and you certainly shouldn’t just ignore them out of hand. This whole thing just sounds like more of your pro-pollution apologetic, as pollution problems turn out to be really happening but we still can’t pin that simple-minded C on the entire system.And once again, with your own arguments, you show that you should in no way be a religious man. It makes no sense, by your logic, for there to be a God. He is the ultimate wee p-value abuse.

JMJ

Read the posts and the comments, and as a statistical lightweight that has learned a lot at this site, I have one question (maybe I missed it). My background is more sciency and less statisticacy (neologism), and was always taught that graphs need units.

What are the units on the “disposition length.? If percent, I’m thinking that if someone gets one day or two days (100%), whoops, so what? if absolute, I’m thinking that if someone gets one day or 100 days, then it is a big deal.

JMJ: Congratulations. I never would have believed anyone could take a discussion on statistics and turn it into a dissing of conservatism based on false premises. Reality really isn’t realm you’ve even visited, is it?

(Once again, you apply logic to all aspects of life when no one ever said God existed based on logic—at least no rational person ever said that and I remember Briggs being clear on that point. Again, you’ve never even visited reality, have you? Yet you complain about the “fantasy” of religion. Check the mirror.)

DAV: I don’t see where p values have anything to do with correlation. Perhaps where there are multiple causes assumed, but in most cases I’ve seen, if there’s a significant p value involved, one factor is elevated to “the cause” (I think of this in medicine, where “smoking causes lung cancer” yet fewer than 10% of smokers actually get lung cancer. Obviously, correlation is lacking at what most would consider a significant level, but medicine calls it significant and declares a cause. Same for global warming.)

Yes, I am agreeing that the p-value tells one what the likelihood a model parameter would not disappear in a subsequent trial. However, that is not how p-values appear to be used anymore. Journal articles proclaim cause based on them and fail to redo the study many times to check for validity.

People seem very attached to the p-value. If it were being used only to show the likelihood a parameter would/would not disappear in subsequent trials, I can’t see why the attachment would exist. Perhaps the reason for removing it would be that people are simply too attached to the improper use to re-educate them. Removal of the p-value would be quicker and less painful in the long run than re-education.

This is silly. Methinks you don’t understand Bayesian factors. How do YOU know that the correlations are spurious? It’s not the model’s fault you withheld relevant info from it. The Bayesian methods obviously give results from the data you feed them, not from the relevant data you withhold from them.

(It’s not the serious capitals it’s the creeping italics you should be worried about.)

Those naughty judges with their gavels. Some RESEARCHERS are like small boys.

“Give a small boy a hammer, and he will find that everything he encounters needs pounding.”

True:

Two small boys + two hammers = Two dead chickens and one dead piano.

Ah poor chickens. I’m not laughing and the ivory was from real elephants. IT”S NOT FUNNY!

Vlad: That’s the point. The correlations apply only to the exact variables fed in. You don’t “withhold”, you miss some variables. And researchers are always trying to extrapolate beyond the EXACT sample they used in the study. In spite of Briggs statements to the contrary, this is what I was taught in statistics when in college (which, admittedly, was back in the days of rational thought and common sense). The result applies only to what you input. All extrapolation is done at your own risk and cannot be justified without a new study using the additional variables. Causality was not proven, only that the result occured more often than “random chance” would predict.