Update On The Death Of P-Values & Statistical “Significance”

Update On The Death Of P-Values & Statistical “Significance”

Received this email from Valentin Amrhein et al. about the efforts to jettison for all time “significance”. And replace it (say I) with this. If only I could get the powers that be to read it, we’d be in business! You know how receptive people are to ideas that say all their old ideas are wrong.

Also see this paper, on which Amrhein, a slew of other true thinkers, and even Yours Truly, discuss similar things. I edited the URLs below into links.

Dear Colleague,

We are writing with a brief update on events following the Nature comment “Retire Statistical Significance”. In the eight months since publication of the comment and of the special issue of The American Statistician, we are glad to see a rich discussion on internet blogs and in scholarly publications and popular media.

One important indication of change is that since March numerous scientific journals have published editorials or revised their author guidelines. We have selected eight editorials that not only discuss statistics reform but give concrete new guidelines to authors. As you will see, the journals differ in how far they want to go with the reform (all but one of the following links are open access).

1) The New England Journal of Medicine, “New Guidelines for Statistical Reporting in the Journal”
(link)

2) Pediatric Anesthesia, “Embracing uncertainty: The days of statistical significance are numbered”
(link)

3) Journal of Obstetric, Gynecologic & Neonatal Nursing, “The Push to Move Health Care Science Beyond p < .05”
(link)

4) Brain and Neuroscience Advances, “Promoting and supporting credibility in neuroscience”
(link)

5) Journal of Wildlife Management, “Vexing Vocabulary in Submissions to the Journal of Wildlife Management”
(link)

6) Demographic Research, “P-values, theory, replicability, and rigour”
(link)

7) Journal of Bone and Mineral Research, “New Guidelines for Data Reporting and Statistical Analysis: Helping Authors With Transparency and Rigor in Research”
(link)

8) Significance, “The S word … and what to do about it”
(link)

Further, some of you took part in a survey by Tom Hardwicke and John Ioannidis that was published in the European Journal of Clinical Investigation along with editorials by Andrew Gelman and Deborah Mayo:
(link)

We replied with a short commentary in that journal, “Statistical Significance Gives Bias a Free Pass”
(link)

And finally, joining with the American Statistical Association (ASA), the National Institute of Statistical Sciences (NISS) in the United States has also taken up the reform issue:
(link)

With kind regards,
Valentin, Sander, Blake

To support this site and its wholly independent host using credit card or PayPal (in any amount) click here

27 Comments

  1. Sheri

    Interesting. Change is always slow. Now, if we can kill p-values and do open publishing of studies, I might even be able to tell if the MSM and medical community made up their idiot guidelines based on a study riveled by the 19th century “guess” method or not they actually did research. All this faked and wrong garbage that substituted for reality today is so very annoying. Currently, if there is no verifiable study, I IGNORE the reports. Truth would be reported with documentation and open to questioning. Lies, on the other hand, are by definition, not documented or questioned.

    Nice listing. I will save it for future reference.

  2. Congratulations! You’re having a positive effect upon the real world.
    Keep pushing! The tower of ignorance will eventually topple.

    Science stopped being a search for truth and became a quest for funding generations ago.

  3. Bill_R

    And how many of those are pozzed? Subjective priors let you make the result as “fair” and as “socially just” as you want.

  4. DAV

    https://journals.sagepub.com/home/bna (picked at random) hasn’t quite given up on p-values but has turned to “reproducibility” with the same data. IOW: if anyone else can use the data to get the same answer. Nice check on math but sort of misses the point.

    They do make noises about “replicability” but couch it in p-value terms:

    Why 0.9 power and not the more traditional 0.8? Both are completely arbitrary values. But let us look at this choice from the perspective of replicability: ‘Studies are often designed or claimed to have 80% power against a key alternative when using a 0.05 significance level, although in execution often have less power due to unanticipated problems such as low subject recruitment. Thus, if the alternative is correct and the actual power of two studies is 80%, the chance that the studies will both show P???0.05 will at best be only 0.80(0.80)?=?64%; furthermore, the chance that one study shows P???0.05 and the other does not (and thus will be misinterpreted as showing conflicting results) is 2(0.80)0.20?=?32% or about 1 chance in 3’ (Greenland et al., 2016).

    As for “reliability”:

    Three fundamental markers of credibility are the reproducibility, replicability and reliability of neuroscience research. We mainly refer to reproducibility and replicability in this editorial

    I would have thought “reliability” would have the highest priority. It’s as if, who cares if a result is reliable if we can reproduce the same calculation results?

  5. DAV

    Is it me or is the double quote at the beginning of a block quote barely discernible? I can see it if I look hard for it but it doesn’t stand out until I use a “High Contrast” app in Chrome.

    Lacking an increase in contrast, maybe indenting the block quote more than on space might make it stand out more as a quote.

  6. Bill_R

    @DAV
    In practice, you want a result to hold up in multiple studies. As in :

    It is the *consistency* of the P-value of the series, under a wide variety of conditions, and not the smallness of any one *P* value by itself that determines a basis for action … Statistical “significance” by itself is not a rational basis for action

    from Deming. Notice the emphasis on action, rather than, say, publications. Fisher and Demming were both working in situations where analysis led to real decisions and actions, not “advancing science.”

    The ability to “reproduce” a single data analysis is useful in auditing circumstances, but not particularly interesting when you also have the scripts and data files. Modern GUIs make it easy to hide a variety of shady practices…

  7. Yonason

    I tried blockquote and didn’t think it worked.

    TESTING

    blockquote

    blockquote

    blockquote^2

    blockquote

    blockquote^2

    blockquote^3

    It works here
    https://htmledit.squarefree.com/
    But that doesn’t mean it will work on this blog. Let’s see what happens.

  8. Yonason

    OK, so if you double up on the brackets, it will noticeably indent. A bit of a pain, but you will see it. Tripling isn’t necessary.

  9. Dean Ericson

    Our host notes:

    “You know how receptive people are to ideas that say all their old ideas are wrong.

    Indeed. But then, how in the devil was the Revolution so successful? Rhetorical question.

    Anyway, happy to hear how your relentless effort beavering away at the rotten foundation of the Revolution is having some success. It’s like you took a page from the termites.

  10. DAV

    Bill_R,

    It is the *consistency* of the P-value of the series, under a wide variety of conditions, and not the smallness of any one *P* value …

    The p-value is effectively a correlation coefficient between current X and Y. It doesn’t do anything for predictive value. I can repeat the study and get the same p-value for different slopes. What would that tell me?

  11. DAV

    Yonason,

    Yeah but only would work if everybody does it. Better if the color of the double quote or surrounding lines were changed from very faint gray to a darker gray.

  12. Yonason

    DAV

    Yes to both.

  13. Interesting that scientists are still using p-values and stat sig language in Nature and ASA publications where their (hit) pieces were published. (not to mention Nobel winners and other research)

    Let’s see, what were the effects after BASP banned the use of inferential statistics in 2015? Did science improve? Ricker et al in “Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban” write

    “In this article, we assess the 31 articles published in Basic and Applied Social Psychology (BASP) in 2016, which is one full year after the BASP editors banned the use of inferential statistics…. We found multiple instances of authors overstating conclusions beyond what the data would support if statistical significance had been considered. Readers would be largely unable to recognize this because the necessary information to do so was not readily available.”

    Also, please read “So you banned p-values, how’s that working out for you?” by Lakens.

    Justin

  14. DAV

    statistical significance

    A meaningless term.
    Here’s a (not necessarily rigid) explanation of statistical significance
    https://towardsdatascience.com/statistical-significance-hypothesis-testing-the-normal-curve-and-p-values-93274fa32687

    Like with most technical concepts, statistical significance is built on a few simple ideas: hypothesis testing, the normal distribution, and p values.

    IOW: it’s a concept built on everything that can be easily abused.

    His explanation of “hypothesis testing” is actually wrong as the Hypothesis Test doesn’t actually test one’s hypothesis but whether the parameter of interest is “significant” whatever that means.

    The “normal distribution” doesn’t actually exist in the real world which is finite. It can be used but with extreme care.

    “p-value” — well, a lot about p-value has been mentioned here which doesn’t warrant repeating.

    ““In this article … one full year after the BASP editors banned the use of inferential statistics…. We found multiple instances of authors overstating conclusions beyond what the data would support if statistical significance had been considered. Readers would be largely unable to recognize this because the necessary information to do so was not readily available.”
    “So you banned p-values, how’s that working out for you?” by Lakens.

    Clearly, many contributors to BASP are as inept in statistics as they are in science.

  15. Bill_R

    @DAV,

    The p-value is effectively a correlation coefficient between current X and Y. It doesn’t do anything for predictive value. I can repeat the study and get the same p-value for different slopes. What would that tell me?

    Nope, not a correlation coefficient at all. It’s a normalized rank statistic that compares a function of sample values to a user-specified reference set. It was never supposed to be a predictive value. If you’ve specified the reference set up right, (see Kolmogorov) you can call it a probability, if you want. Matt favors Cox’s derivation which apparently does away with the need to specify reference sets (IIRC Kolmogorov is a special case) I personally dislike using the word probability when there is no variability (the data is observed, it isn’t changing).

    Your p-value example suggests an alternative understanding of p-values . The sample p-value is not (usually) one-to-one with a sample slope. See above about reference sets. Different reference sets, different meaning to a p-value. For real world comparisons across different data sets, the reference sets have different elements and are only loosely comparable. ( Perhaps you are using asymptotic approximations?) It’s always nice to specify your units and reference sets for empirical work.

  16. DAV

    not a correlation coefficient at all.

    Used as one so it is.

    It’s a normalized rank statistic
    Why is this relevant?

  17. Bill_R

    @DAV

    I can use a screwdriver as a hammer or a pry bar, too.

    It is relevant if you wish to understand the meaning and limits of a p-value, rather than, say, pretending it is a correlation or a predictive value. It’s a specific tool that addresses the oddity of an observed value relative to your expectations. Hence, a rank statistic.

  18. DAV

    I can use a screwdriver as a hammer or a pry bar, too.
    Congrats. Always good to have a job skill.

    BR:It’s a normalized rank statistic
    DV:Why is this relevant?
    BR:It is relevant if you wish to understand the meaning and limits of a p-value
    So you don’t know why it is relevant. OK.

    pretending it is a predictive value
    The exact negation actually. Pay attention.

    I realize your mind was crammed full of crap and stat-speak in that one stat class you were required to to take alongside your major intro (Psych-101) so asking the following is probably unfair:

    What do Hypothesis Tests and Chi-Squared Tests do qualitatively? Hint: the same thing which can be summed in one sentence. What falls out of theses tests which can be the difference between entering Academic Nirvana and being relegated to continued slogging? For extra credit: in a few sentences (five or less; you only need one) why are they done?

    If you can answer those then you should understand what I’m saying. The last is probably the most important. But I understand your journey may be long and arduous.

  19. Yonason

    Couple of items…

    National Propaganda Radio weighs in…
    https://www.npr.org/sections/health-shots/2019/03/20/705191851/statisticians-call-to-arms-reject-significance-and-embrace-uncertainty

    Interesting survey of opinions that wee-p must or must-not go, but with only superficial treatments of why (at least some wrong). Typical NPR word fog.

    And here, ASA tosses wee-p a Bayesian life raft.
    https://amstat.tandfonline.com/doi/full/10.1080/00031305.2019.1699443#.XeoWzdV7mM8
    I don’t know enough to evaluate it, but thought it might be of interest. Based on what I read here, they may be over-reaching?

    Looks like a long battle ahead. Probably what would work best to kill wee-p would be one or more spectacular failures of predictions it had “validated” with great public fanfare. Preferably they would be greatly embarrassing but not harmful.

  20. Bill_R

    @DAV

    Congrats. Always good to have a job skill.

    Thanks, but I don’t need another one. I’m comfortably retired. But please do keep up your payments into social security and Medicare. There are plenty of needy old Boomers out there.

    you were required to to take alongside your major intro (Psych-101)

    Nice guess. However Psych was optional for math and science majors. Took Poly-Sci instead. It was that or “Christian Marriage,” taught by a Jesuit.

  21. Bill_R

    @Yonason

    This comes around every couple of decades or so. This cycle appears to have really kicked off when Ioanndis discovered that some studies don’t replicate. (Horrors! zzz) The newer Bayesians were all over that one now that they had their shiny new MCMC toys. It will eventually peter out, as new topics come up (Machine Learning? Algorithmic Bias?)

    Meanwhile the companies, the QC departments, the commercial development outfits, and the scientists will keep on doing their thing, making go/no-go decisions from data.

  22. Kalif

    @DAV
    Could you please clarify what you meant by “…What do Hypothesis Tests and Chi-Squared Tests do qualitatively…”?
    (Reminds me of “Africa and other countries” by Bush Jr.)

    Chi-Square is just one of hypothesis tests, of which there are dime a dozen, used when both independent and a dependent variables are categorical.

    Looks like many on this blog are confusing hypothesis testing which is what real science is based on and NHST (null hyp. statistical testing) which is bad as there is no such thing as a null hypothesis (no effect). That’s why p values, although mathematically correct are useless, because they are used to test against the null (which, again, doesn’t exist).

  23. DAV

    Kalif,

    I’m not always precise in my language. Think of it as a beauty mark like the one on Marilyn Monroe’s face. 🙂

    The answer I was looking for was they are tests for correlation. The why is that correlation is a requirement for causality and the p-value is used as the magic determiner of correlation (as in, yes, yes, that barely perceptible slope is REAL!).

    The actual null hypothesis – regardless of how stated — is no correlation. I think the Hypothesis Test was originally intended as a quick and dirty check to see if further investigation might be warranted but has morphed into the Holy Grail.

    The reason why p-values are useless is that one would expect that the models falling out of a hypothesis test have some validity — meaning they have predictive value. P-values don’t lend themselves to that determination. All one learns is that X & Y sets are correlated (or not) given the current study. If even that, given the propensity for p-values to be small with large amounts of data.

    Lip service is given to “correlation does not necessarily imply causation” by using words such as “linked” (meaning “correlated”) but few act as if they really believe that.

  24. Bill_R

    @DAV

    I’m not always precise in my language… The actual null hypothesis – regardless of how stated — is no correlation

    Your comment to Kalif clarifies a lot. If by “correlation” you mean “comparison” then yes, a hypothesis test is a comparison of data and some prior belief or specification about how it should behave. Since the observed statistic is typically a single number and it’s being compared to a range of values, the comparison is some form of rank statistic, a p-value.

    Most, if not all, of statistics is comparative. That comparison might be in the interests of causality or it might not. That depends on the practical application. Sometimes it’s just for a handy one line summary (e.g. “the median load at failure in the sample was 127 lbs, 8 lbs. below our pre-specified cut-off”) that leads to an action or gets saved for future reference. In this example I’m not interested in causality at all. I’m interested in whether or not the incoming lot is acceptable. I don’t care why it failed. (The example is also one-sample, one-sided, and not a “normal” curve.)

  25. DAV

    Bill R,

    I had in mind examples such as these from Wikipedia:
    https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Use_and_importance

    All of them are tests for correlation. For instance:
    1) sex is correlated to nightmares
    2) the documents can be correlated to said author
    3) the full moon is correlated to behavior
    4) distance is correlated to insect detection
    5) etc.

    It’s also what Briggs’s blog posts are all about.

    Even for MTBF, the test is a correlation between Decision Cutoff and Failures.
    https://www.afit.edu/stat/statcoe_files/Reliability_Test_Planning_for_Mean_Time_Between_Failures2.pdf
    (see figure 1 at the top of page 3).
    This isn’t a comparison. It is a prediction.

  26. Yonason

    Bill R

    “It will eventually peter out, as new topics come up (Machine Learning? Algorithmic Bias?) “

    Algorithmic Bias, you say?

    At the end of the day algorithmic bias is a human problem, not a technical one, and the real solution is to start removing bias in every aspect of our personal and social lives. This means endorsing diversity in employment, education, politics and more. If we want to fix our algorithms, we should start by fixing ourselves. ______ https://bdtechtalks.com/2018/03/26/racist-sexist-ai-deep-learning-algorithms/

    I can hardly wait.

    (Yeah, DAV, it is best when everyone uses it. But I don’t have another century or two to wait.)

  27. DAV

    Algorithmic bias is inevitable (at least for now) because bias is inherent in the data. Example, a model built on a dataset containing 1000 variants of apples and a dozen kiwi fruit will likely call the kiwi fruit an apple as apple is the majority class — a safe bet. There are ways around this but has to be built into the design. Not always easy to do. Why anyone would believe otherwise is probably part of the wishful thinking rampant in our current society (see the quote calling for more diversity).

    The one main advantage to ML algorithms is that nearly all make attempts at model generalization and specifically test for it. Unfortunately they can’t be any better than the available data. This is unlike the practice in fields like psychology and epidemiology where the study is published without any attempt at verification. Apparently this is beginning to be noticed given this blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *