Skip to content

Manipulating the Alpha Level Cannot Cure Significance Testing

Nothing can cure significance testing. Except a bullet to the p-value.

(That sound you heard was from readers pretending to swoon.)

The paper is out and official—and free!: “Manipulating the Alpha Level Cannot Cure Significance Testing“. I am one (and a minor one) of the—Count ’em!—fifty-eight authors.

We argue that making accept/reject decisions on scientific hypotheses, including a recent call for changing the canonical alpha level from p = 0.05 to p = 0.005, is deleterious for the finding of new discoveries and the progress of science. Given that blanket and variable alpha levels both are problematic, it is sensible to dispense with significance testing altogether. There are alternatives that address study design and sample size much more directly than significance testing does; but none of the statistical tools should be taken as the new magic method giving clear-cut mechanical answers. Inference should not be based on single studies at all, but on cumulative evidence from multiple independent studies. When evaluating the strength of the evidence, we should consider, for example, auxiliary assumptions, the strength of the experimental design, and implications for applications. To boil all this down to a binary decision based on a p-value threshold of 0.05, 0.01, 0.005, or anything else, is not acceptable.

My friends, this is peer-reviewed, therefore according to everything we hear from our betters, you have no choice but to believe each and every word. Criticizing the work makes you a science denier. You will also be reported to the relevant authorities for your attitude if you dare cast any doubt.

I mean it. Peer review is everything, a guarantor of truth. Is it not?

Or do we allow the possibility of error? And, if we do, if we are allowed to question this article, are we not allowed to question every article? That sounds mighty close to Science heresy, so we’ll leave off and concentrate on the paper.

Now I am with my co-authors a lot of the way. Except, as regular readers know, I would impose my belief that null hypothesis significance testing be banished forevermore. Just as the “There is some good in p-values if properly used” folks would impose their belief that there is some good in p-values. Which there is not.

Another matter is “effect size”, which almost always means a statement about a point estimate of a parameter inside an ad hoc model. These are not plain-English effect sizes, which implies causality. How much effect x has on y. But statistical models can’t tell you that. They can, when used in a predictive sense, say how much the uncertainty of y changes when x does. So “effect size” is, or should, be thought of in an entirely probabilistic way.

The conclusion we can all agree with:

It seems appropriate to conclude with the basic issue that has been with us from the beginning. Should p-values and p-value thresholds, or any other statistical tool, be used as the main criterion for making publication decisions, or decisions on accepting or rejecting hypotheses? The mere fact that researchers are concerned with replication, however it is conceptualized, indicates an appreciation that single studies are rarely definitive and rarely justify a final decision. When evaluating the strength of the evidence, sophisticated researchers consider, in an admittedly subjective way, theoretical considerations such as scope, explanatory breadth, and predictive power; the worth of the auxiliary assumptions connecting nonobservational terms in theories to observational terms in empirical hypotheses; the strength of the experimental design; and implications for applications. To boil all this down to a binary decision based on a p-value threshold of 0.05, 0.01, 0.005, or anything else, is not acceptable.

Bonus Disguising p-values as “magnitude-based inference” won’t help, either, as this amusing story details. Gist: some guys tout massaged p-values as innovation, are exposed as silly frauds, and cry victim, a cry which convinces some.

Moral: The best probability is probability, and not some ad hoc conflation of probability with decision, which is what all “hypothesis tests” are.

3 thoughts on “Manipulating the Alpha Level Cannot Cure Significance Testing Leave a comment

  1. I am one (and a minor one) of the—Count ’em!—fifty-eight authors.

    Yeah, but number nine in the list is so not so bad if ranking means anything.

    Or do we allow the possibility of error? And, if we do, if we are allowed to question this article, are we not allowed to question every article?

    And if we question every article, is that not an indictment of all peer review? And if we question all peer review, is that not a severe criticism of the whole scientific publishing enterprise? And if we condemn scientific publishing, is that not a condemnation of all Science. Well, sir, I am not going to let you condemn all of Science! 😉

  2. First, I got my laugh of the day with the comment about how to cure significance testing!

    I’m pleased to see this was published in a psych journal. Psychologists just love that p value. Makes everything so sciency. Hopefully, a few medical researchers will also see it. I have a distinct dislike for what the p value has done to science and how it has elevated pseudoscience. When so many things that were “significant” in test A turn out to be not so significant when the study is replicated, researchers should have wondered. Instead, they just stopped duplicating studies. I really miss science and reality sometimes…..

    Great paper.

  3. Medical people, particularly the actual practitioners of the art, have long been aware of the limits and misuse of p-values. Some 3 decades ago, they started formally incorporating Bayesian inference. As Briggs says, even that has limits.

    The practicing medico has to make decisions and the person the practicing medico is advising has to make decisions and both are doing such under uncertainty. Unfortunately, the legal people didn’t get the memo. Let’s not bring up jury people, at least in the USA. Let me say that having been called for jury duty and gone through voir dire left me speechless.