Nature magazine reports “‘One-size-fits-all’ threshold for P values under fire: Scientists hit back at a proposal to make it tougher to call findings statistically significant.”
Researchers are at odds over when to dub a discovery ‘significant’. In July, 72 researchers took aim at the P value, calling for a lower threshold for the popular but much-maligned statistic. In a response published on 18 September, a group of 88 researchers have responded, saying that a better solution would be to make academics justify their use of specific P values, rather than adopt another arbitrary threshold.
P values have been used as measures of significance for decades, but academics have become increasingly aware of their shortcomings and the potential for abuse. In 2015, one psychology journal banned P values entirely.
The statistic is used to test a ‘null hypothesis’, a default state positing that there is no relationship between the phenomena being measured. The smaller the P value, the less likely it is that the results are due to chance — presuming that the null hypothesis is true. Results have typically been deemed ‘statistically significant’ — and the null hypothesis dismissed — when P values are below 0.05.
It is past time to eliminate p-values entirely, as I argue in Uncertainty. See most recently “P-values vs. Bayes Is A False Dichotomy“, and the book page with links to many articles. There is a third way besides Bayes and P.
Especially if you are a p-value supporter, re-read these two lines:
The statistic is used to test a ‘null hypothesis’, a default state positing that there is no relationship between the phenomena being measured. The smaller the P value, the less likely it is that the results are due to chance — presuming that the null hypothesis is true.
Now without using causal language, because p-values (nor any probability model) cannot discover cause, explain what “no relationship” means. Then, again without using causal language, explain “results are due to chance”. (Don’t forget to do this.)
I assert, and prove in Uncertainty, that you cannot explain “no relationship” or “due to chance” without using causal language or by assuming probability exists. I mean exists in the proper metaphysical sense, the same sense as saying the screen on which you are reading this exists (or paper, if some generous soul has printed it out for you).
Chance does not exist. That which does not exist cannot cause anything to happen. Probability does not exist. That which does not exist cannot cause anything to happen. Nothing can therefore be “due to” chance. Probability cannot establish a relationship in any ontic sense.
Probability is epistemic. It is a epistemological measure, not necessarily quantitative, between a set of premises (or assumptions, measurements, etc.) and a proposition of interest. That, and nothing more. (This is no different than what logic is, of course.)
That simple statement is the third way. Eliminate p-values entirely, and Bayesian inference of non-observable parameters, and concentrate on probability. In science, which centers around observables, given a model, make probabilistic predictions of never-before-seen-in-any-way observables. And then check those predictions against reality. This is what civil engineers do when building bridges, and it is what solid-state physicists do when creating circuits.
Why not do the same for psychology, medicine, sociology, and other statistics-relying fields?
Answer: why not, indeed.
Setting specific thresholds for standards of evidence is “bad for science”, says Ronald Wasserstein, executive director of the American Statistical Association, which last year took the unusual step of releasing explicit recommendations on the use of P values for the first time in its 177-year history. Next month, the society will hold a symposium on statistical inference, which follows on from its recommendations.
Wasserstein says he hasn’t yet taken a position on the current debate over P value thresholds, but adds that “we shouldn’t be surprised that there isn’t a single magic number”.
There isn’t, though the vast majority of users of p-values think there is. The threshold picked is mesmerizing. The number 0.04999 brings joy, 0.05001 tears. This happens.
I’m not a member of the American Statistical Association (or any other organization), so won’t be at the meeting Wasserstein mentions. I have a small paper coming out soon (I thought it would be out by now) in the Journal of the American Statistical Association, in answer to a discussion on p-values, detailing the third, i.e. predictive, way. I don’t guess it will show before the conference, which is in a couple of weeks.
I only heard of the conference after it was set, so I’ll miss that opportunity to spread the word (in an official talk). But if you’re going, or go, let us know about it in the comments below.
Thanks to Marcel Crok for notifying us of this article.
Excellent news! I’ve been slowly trying to slowly ease my colleagues at my University over to Probabilistic Logic (making logic statements about the data and calculating probabilities based on the data) and also making predictions using models constructed. Slow process for sure, some are waking up! I try to show how much these methods communicate results more clearly and actually get at what we are interested in finding. I don’t diss p-values in front of them, knowing i’ll get backlash for it, so I tip toe around it. Looking forward to reading your paper!
Why not do the same for psychology, medicine, sociology, and other statistics-relying fields?
What works in science may not work in voodoo?
In quality assurance work, we used the p-value to determine whether a particular rock was worth rolling over to see what was underneath. IOW, it was a first step in the search for cause. Since “due to chance” is simply a shorthand for “due to one of a great many small causes, none of which are economical to break down further,” the p-value is a useful filter between wild goose chases and chases in which there is likely to be a goose at the end.
Defending the use of 3-sigma limits on control charts, Walter Shewhart said that there was no statistical justification for any such limit. The justification was practical. When fluctuations in the process went beyond three-sigma and a diligent search was made for the assignable cause, one was typically found; whereas when fluctuations remained within the limits, no such cause would be found.
During a seminar on applied statistical engineering, the late great Dorian Shainin put up an acetate slide — remember those? — listing “confidence intervals”: 90%, 95%, 99%, 99.9%, and 99.99%. How do you decide which to use, he was asked.
Simple, he declared. 95% is the standard default level. But when there is no assignable cause, there is a 5% risk of a wild goose chase. Now, if the value of the goose is high, but the cost of the chase is low, it’s worth rolling over more rocks in hope of finding one, so you’d use 90%. However, if the cost of chasing geese is high, then you would use 99%. Someone asked: When do you use 99.9%? That’s when you’re dealing with problems like engines falling off jet airplanes.
Someone took the bait. But then when do you use 99.99%?
“When I’m on the airplane!
IOW it came down to the relative comfort level between getting a longer list of rocks to roll over and a more focused list of rocks more likely to have something under them.
Nice one, YOS. In medicine, there are always multiple causes and one or more ‘final common pathways’. Signs and symptoms overlap often. Differential diagnosis is the art applied. When asked “What are my chances, Doc?”, one must be cautious. Epidemiological results don’t map one-to-one to living individuals. Humility and honesty are the best practices. So the Doc should say: “I’m not God, so I don’t know with certainty. Given 1000 people like you, 500 will not make it x (arbitrary number from flawed studies) length of time and 500 will make it at least that long.”
Acetate slides, oh yes I remember them. I’ve seen thousands of these over 50 years, though not as many in the last 20. My parents had boxes of them ;).
Is it possible that a psychology journal was the first to ban publication of papers with p-values included!? See: http://www.nature.com/news/psychology-journal-bans-p-values-1.17001
From the 9 March 2015 publication:
“…the editors of Basic and Applied Social Psychology (BASP) announced that the journal would no longer publish papers containing P values because the statistics were too often used to support lower-quality research1.
“Authors are still free to submit papers to BASP with P values and other statistical measures that form part of ‘null hypothesis significance testing’ (NHST), but the numbers will be removed before publication.”
“In an editorial explaining the new policy, editor David Trafimow and associate editor Michael Marks, who are psychologists at New Mexico State University in Las Cruces, say that P values have become a crutch for scientists dealing with weak data. “We believe that the p < .05 bar is too easy to pass and sometimes serves as an excuse for lower quality research,” they write."
@Ken: Right on, as we used to say.
Here’s a thought experiment: imagine testing ten different treatments against ten different diseases and using a p-value of only 0.01. Suppose further that (unbeknownst to you) none of the treatments are effective against any of the diseases. IOW, there is no relationship whatsoever. Then p=0.01 means that at least one of the treatments will be effective against at least one of the diseases, there being 100 different experiments being run.
Since there are a great many researchers investigating a great many conditions, it is no wonder that an experiment will eventually throw off a false positive. Since the one false positive gets published and the 99 no-results do not, we find that “most published research is wrong.” Though perhaps the proposed ban on p-values is intended to make replication less rigorous…
@cdquarles
Multiple causes, for sure. I wonder that medical researchers do not use fault tree analysis as do aerospace and others. Or perhaps they do. I know that financial services industry had never even heard of FMEA.
Hark! There is a gulf between the place where the treatment happens and the place where the research is done!
Clinicians have been complaining about this for years. Way before the ‘studies that show’ phenomenon which has finally reached every person’s awareness.
….
Prognosis is what some above are talking about not differential diagnosis.
Diagnosis is indeed an art and there is no statistics involved whatsoever.
Clinical experience and simple logic along with intuition, which involves listening, remembering, the unusual and assiduous checking all make for accurate diagnosis. Differential diagnosis is just part of the diagnostic process. Diagnostic tests and machines do not replace decisions and neither do statistical tools. It is a balance of evidence based decision every time.
Not one percent or number comes into the process. What is being referred to may be such things as threshold for a positive test but this does not reach the clinic. The test, in haematology, for example, already shows normal values so high, low, normal are somewhat already decided. Still, an abnormal function on a blood result alone is not enough for a diagnosis if the patient has no symptoms or signs, which has happened! Radiology gives the most specular examples of this.
What dQuarles is referring to is prognostics, which is interesting to a researcher, perhaps, especially if they’re the one proving their treatment outcome.
Prognosis given in the clinic is always known to the clinician for what it is. Only some numerically minded think these things can be known.
There is a staggering amount of research carried out for the sake of research. Or research to keep the researcher as far from patients as possible! It’s all wrong. There’s enough money if it were just redistributed to known treatment as priority unless a good case is made. In this country money comes from the public purse and from rich self serving charities. It’s my guess that there will be resistance to changing the threshold if it makes people’s jobs more difficult to prove what isn’t true to start with.
It’s a little distressing when even YOS writes as if there were such a thing as ‘the’ p-value (from which inferences may be drawn), when Matt already years ago showed calculated ‘p-values’ varying in a range between 0.87 and 0.08 for the same data, depending on the test statistic deployed – and with absolutely no mathematical grounds to prefer one test statistic over another. And accordingly, arguments that ‘the’ p-value somehow shows the chances of something-or-other happening or not happening with ‘repeated’ experiments must be at least as flimsy, if not more so.
Of course, it is much more alarming that the current crop of statisticians (let alone the millions of others who use ‘statistics’) show little, if any, willingness to understand the “p-value crisis” as a window into the larger issue: that p-values and all the rest of it never had more than heuristic value merely, in the now Ancient Days when back-of-the-envelope calculations were nearly all that could be dreamed of, and when the prospect of “integrating out” all parameters seemed realistically impossible.
Does the entire edifice of “modern statistics” need to continue, despite the provable fact that it is no longer necessary, and that it unnecessarily misleads, to no good purpose, and that even the clunkiest computer today can “integrate out” parameters and reveal actual probabilities, given whatever model and evidence? Apparently even the question cannot yet be asked, let alone squarely faced. Nor, apparently, can a related issue even be entertained, even in the bosoms of those statisticians who regard themselves as the most ‘skeptical’ and au courant,; namely, that the humility of predictive statistics is the only real way forward; and that is… disappointing.
The only inference I draw is that it may signal when you are wasting your time looking for an assignable cause and when the search may be fruitful. It will never on its own indicate an assignable cause.
Example: an analysis of string breaks on an eight-spool string applicator showed that spools with higher numbers experienced more frequent breaks than one might suppose assuming a Poisson model. (Individual events relatively rare; but opportunities for breaks effectively continuous.) This did not mean that renumbering the spools to give them lower numbers, as might be suggested by a soft scientist, would reduce string breakage. But it did indicate that a close examination of the tracks and bolt eyes through which the strings ran might bear close examination on those spool lines. Lo! It was so. The wax-impregnated strings (which strengthen the paperboard wrappers) were found to have worn grooves in some of the bolt-eyes, creating burrs on which the string might snag and break. These conditions were rectified and preventive inspection of bolt wear implemented to anticipate future problems.
But it was the visual inspection and the mechanical engineering that identified and solved the cause. The statistics only identified a rock worth rolling over. And we would not have done so on a measly 0.05 alpha risk. In on-going manufacturing processes, the consequences of false alarms or wild good chases are unacceptable.
” In science, which centers around observables, given a model, make probabilistic predictions of never-before-seen-in-any-way observables.”
But how does that work in non-science applications, such as when we say this horse has a 20% chance of winning?
What does it actually mean? For science deals with repeatable events, but this particular event–this horse running in this race with these competitors is a non-repeatable event.
@YOS
About Fault Tree analysis, I don’t know. I’m not in that business now. What I do remember from my pathology lab days decades ago was that the fad of the month was Root Cause Analysis. Oh well.
Alas, if only more people did them correctly! But many folks attracted by the outward forms of successes, imitate only the form and not the substance and wind up with mere ritual.