Good news and bad news. Good news it that there is a growing number of people who are aware that p-values are sinking science and they want to do something, really do something, to fix the situation.

Bad news: the proposed fixes are only tweaks, like requiring wee p-values be weer (wee-er?) or to use Bayes factors.

There is also terrific news. We can fix much—but *not* all, never all—of what is broken by eliminating p-values, hypothesis tests, and Bayes factors altogether. Banish them! Bar them! Blacklist, ban, and boycott them! They do not mean what you think they do.

The main critique comes in a new paper co-authored by a blizzard of people: “Redefine Statistical Significance“. The lead author is Daniel J. Benjamin, and the second is Jim Berger, who is well known. To quote: “One Sentence Summary: We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005.”

There is at least one blog devoted to the reproducibility or replication crisis in science (thanks to Gary Boden for finding this one). There is growing recognition of severe problems, which lead one fellow to tell us “Why most of psychology is statistically unfalsifiable“. And see this: “The new astrology: By fetishising mathematical models, economists turned economics into a highly paid pseudoscience.”

**What are P-values?**

Despite what frequentist theory says, the great bulk of hypothesis test users believe wee p-values (and large Bayes factors) have proved, or given great weight to, causal relations. When a p-value is wee, they say X and Y are “linked” or “associated”, and by that they always mean, even if they protest they do not mean, a causal relationship.

A wee p-value means only one thing: the probability of seeing an *ad hoc* statistic larger than the one you did see is small given a model you do not believe. This number is as near to useless as any number ever invented, for it tells you *nothing* about the model you don’t believe, nor does it even whisper anything about the model you do believe.

Every use of the p-value, except in the limited sense just mentioned, involves a fallacy. I prove—as in *prove*—this in this award-eligible book *Uncertainty: The Soul of Modeling, Probability & Statistics *. How embarrassing not to own a copy!

Also see this blog’s Book Page which has links to many articles on relevant topic.

**The replacement**

I have an upcoming *JASA* paper in discussion to Blakeley McShane and David Gal’s also upcoming “Statistical Significance and the Dichotomization of Evidence”, in which I outline the replacement for p-values. Academic publishing is lethargic, so look for this in August or even December.

Meanwhile, here are elements of a sketch of a condensation of an abbreviation of the outline. The full thing is in *Uncertainty*. I will answer below any *new* criticism that I have not already answered in *Uncertainty*—meaning, if I don’t answer you here, it means I already have in the book.

We have interest in proposition Y, which might be “This patient gets better”. We want the probability Y is true given we know X_0 = “The patient will be treated by the usual protocol” or X_1 = “The patient will be treated by the *New & Improved!* protocol”. We have a collection of observations D detailing where patients improved or not and which protocol they received. We want

Pr(Y | X_i D).

This could be *deduced* in many cases using finite, discrete probability, but that’s hard work; instead, a probability model relating D and X to Y is proposed. This model will be parameterized with continuous-valued parameters. Since all observations are finite and discrete, this model will be an approximation, though it can be an excellent one. The parameters are, of course, of no interest whatsoever to man or beast; they serve only to make the model function. They are a nuisance and no help in answering the question of interest, so they are “integrated out”. The end result is this:

(1) Pr(Y | X_i D M),

where M is a complicated (compound) proposition that gives details about the model proposed by the statistician. This is recognized as the predictive posterior distribution given M. M thus also contains assumptions made about the approximate parameters; i.e. whether to use “flat” priors and so on.

This form has enormous benefits. It is in plain language; specialized training isn’t need to grasp model statements, though advanced work (or better software) is needed to implement it. Everything is put in terms of *observables*. The model is also made prominent, in the sense that it is plain there is a specific probability model with definite assumptions in use, and thus it is clear that answers *will* be different if a different model or different assumptions about that model are used (“maxent” priors versus “flat”, say).

*Anybody* can check (1)’s predictions, even if they do not know D or M’s details. Given M and D, authors might claim there is a 55\% chance Y is true under the new protocol. Any reader can verify whether this prediction is useful for him or not, whether the predictions are calibrated, etc. We do not have to take authors at their word about what they discovered. Note: because finding wee p-values is trivial, many “novel” theories will vanish under (1) (because probabilistic predictions made using and not using the “novel” theory will not differ much; p-values wildly exaggerate differences).

A prime reason p-values were embraced was that they made automatic, universal decisions about whether to “drop” variables or to keep them (in a given model schema). But probability is not decision; p-values conflated the concepts. P-values cannot discover cause.

There are an infinite number of “variables” (observations, assumptions, information, premises, etc.) that can be added to the right-hand-side of (1). In our example, these can be anything—they can always be anything!—from a measure of hospital cleanliness to physician sock color to the number of three-teated cows in Cleveland. The list really is endless. Each time one is put into or removed from (1), the probability changes. Which list of variables is correct? *They all are*. This is true because all probability is conditional: there is no such thing as unconditional probability (this is also proven in *Uncertainty*).

The goal of all modeling is to find a list of true premises (which might include data, etc.) which allow us to determine or know the cause of Y. This list (call it C) will give extreme probabilities in (1); i.e.

Pr(Y | X_i D C) = 0 or 1.

Note that *to determine* and *to cause* are not the same; the former means *to ascertain*, while the latter is more complex. Readers generally think of efficient causes, and that is enough for here, though these comprise only one aspect of cause. (Because of underdetermination, C is also not unique.) Discovering cause is rare because of the complexity of C (think of the myriad causes of patient improvement). It is still true that the probabilities in (1) are correct when M is not C, for they are calculated based on different assumptions.

What goes into M? Suppose (observations, assumptions, etc.) some W is considered. The (conditional) probability of Y with and without W in (1) is found; if these differ such that the model user would make different decisions, W is kept; else not. Since decision is relative there is thus *no* universal solution to variable selection. A model or variable important to one person can be irrelevant to another. Since model creators always had infinite choice, this was always obvious.

About mathematical models in economics, Ludwig von Mises had much to say about it and it was disparaging. Don’t go there, he said, because the basis of human action isn’t measurable. He said that about a century ago, long before electronic calculators were a thing and Fisher’s and other’s statistical theories were just beginning to take hold.

A big portion of the problem, not addressed above, is that the decision-makers are never the same people who perform the analysis and make the models; hence the presentation of the analysis and model must be fairly idiot-proof. Take actuarial work as an example for a moment.

It is customary to present to management a credibility-weighted point estimate of the indicated rate change, despite this custom’s many inaccuracies (that the number required for “full” credibility is a sample number only for a p of 90/10, that there is always debate about what the complement of credibility should be, and that there is absolutely no mathematical basis for the process of “credibility weighting.”

It is possible, however, to express the indicated rate change as a range with a confidence interval. While this is more accurate from the standpoint of the mathematics – leaving aside the issues of the two dozen or so rather debatable assumptions embedded in the analysis – it is far more problematic for the decision-makers. Given the choice of seeing “we need to raise rates by 4.1%” or “there is a 90% probability that we need to change rates by somewhere between minus 1.1% and plus 9.1%,” management prefers the false certainty of the point estimate, and would question both the legitimacy of the measurement (rightly so) and the utility of the actuarial department (wrongly so). In addition, the insurance regulators of the State in question would abuse the given range for political purposes, and there would be a much more tenuous defense made possible by the admittance of inherent inaccuracy.

While this is admittedly a specific example, it would be trivial to transport it to other spheres of discussion such as social sciences or economics (but I repeat myself as economics is merely a social science).

tl;dr

Mathematical solutions to the use of p-values ignore the fact that there is a disconnect between analysts and management (loosely defined as those who make the decisions based on the data).

cdquarles —

However, LvM was a fellow traveler with his brother, Richard, in regards to frequentism, a view that is anathema to Briggs.

TipTipTopKek,

Yes, as I say often (and in the book) probability is not decision. A model may be good or useful to one decision make and be bad and useless to another.