Today’s title is lifted directly from the paper of E. J. Masicampo & Daniel R. Lalande, published in The Quarterly Journal of Experimental Psychology. The paper is here, and is free to download. The abstract says it all:
In null hypothesis significance testing (NHST), p values are judged relative to an arbitrary threshold for significance (.05). The present work examined whether that standard influences the distribution of p values reported in the psychology literature. We examined a large subset of papers from three highly regarded journals. Distributions of p were found to be similar across the different journals. Moreover, p values were much more common immediately below .05 than would be expected based on the number of p values occurring in other ranges. This prevalence of p values just below the arbitrary criterion for significance was observed in all three journals. We discuss potential sources of this pattern, including publication bias and researcher degrees of freedom.
File our reaction under “Captain Renault, Shocked.” Regular readers will know that the cult of the p-value practically guarantees Masicampo & Lalande’s result (Pp<0.0001). We will not be surprised if this finding is duplicated in other journals.
Here’s what happened: Our stalwart pair thumbed through back issues of several psychology journals and tabulated the appearance of 3,627 p-values, then plotted them:
Perhaps hard to see in this snapshot, there are unexpected bumps in the distribution of p-values at the magic value, the value below which life is grand, the number above which consists of weeping and gnashing of teeth. Understand that these are the p-values that are scattered throughout papers and not just “the” p-values which “prove” that the authors’ preconceptions are “true,” i.e. the p-valus of the main hypotheses.
Masicampo and Lalande rightly conclude:
This anomaly is consistent with the proposal that researchers, reviewers, and editors may place undue emphasis on statistical significance to determine the value of scientific results. Biases linked to achieving statistical significance appear to have a measurable impact on the research publication process.
The only thing wrong with the first sentence is the word “may”, which can be deleted; the deletion of the second sentence is “appear.”
Why p-values? Why are they so beloved? Why, given their known flaws and their ease of abuse, are they tolerated? Well, they are a form of freedom. P-values make the decision for you: thinking is not necessary. A number less than the magic threshold is seen as conclusive, end of story. Plug your data into automatic software and out pops the answer, ready for publishing.
But this reliance “exposes an overemphasis on statistical significance, which statisticians have long argued is hurtful to the field (Cohen, 1994; Schmidt, 1996) due in part because p values fail to convey such useful information as effect size and likelihood of replication (Clark, 1963; Cumming, 2008; Killeen, 2005; Kline, 2009; Rozeboom, 1960).”
I left those references in so you can see that it is not just Yours Truly who despairs over the use of p-values. One of these references is “Cohen, J. (1994). The earth is round (p<.05). American Psychologist, 49, 997–1003.” This is a well-known paper, written by a non-statistician, which I encourage you to see out and read.
The real finding is the subtle confirmation bias that seeps into peer-reviewed papers. The, conscious or not shading of results in the direction the of authors’ hope. Everybody thinks confirmation bias happens to the other guy. Nobody can see his own fingers slip.
Everybody also assumes that the other fellows publishing papers are “basically honest.” And they are, basically. More or less.
Update Reader Koene Van Dijk notes the paper is no longer available free, but gives us the email address of the authors: masicaej@wfu.edu or lalande.danielr@gmail.com.
———————————————————-
Thanks to the many readers who sent this one in, including Dean Dardno, Bob Ludwick, Gary Boden; plus @medskep at Twitter from whom I first learnt of this paper.
Seems complaints about p-factors are as effective as complaints about the weather.
DAV,
Ain’t it the truth?
In quality control practice, a single event beyond 2-sigma, was not regarded as a trigger to investigate further. This was because, unlike scientific research, there was always more data coming when dealing with manufacturing processes. What did trigger an investigation was the occurrence of two events out of three consecutive samples falling beyond 2-sigma on the same side. Unlike scientists, quality engineers did not declare the discovery of some new scientific law, but only a point at which it was economically worthwhile investigating for an assignable cause.
P-values can be useful when properly used. Sometimes you will detect a statistically significant difference—yet it will be too small to be of practical importance to your business. Sometimes you cannot claim a difference is statistically significant, yet the observed difference is of importance to your business.
The search for p values less than .05 is like the search for the holy grail. You obtain the correct p value then you receive publication. In academia, it’s publish or perish.
Pingback: The “p” stands for “perplexing” » Source-Filter