Statistics

# New Paper! Everything Wrong With P-values Under One Roof

Here is a link to the PDF.

Briggs, William M., 2019. Everything Wrong with P-Values Under One Roof. In Beyond Traditional Probabilistic Methods in Economics, V Kreinovich, NN Thach, ND Trung, DV Thanh (eds.), pp 22–44. DOI 978-3-030-04200-4_2

Here is the Abstract:

P-values should not be used. They have no justification under frequentist theory; they are pure acts of will. Arguments justifying p-values are fallacious. P-values are not used to make all decisions about a model, where in some cases judgment overrules p-values. There is no justification for this in frequentist theory. Hypothesis testing cannot identify cause. Models based on p-values are almost never verified against reality. P-values are never unique. They cause models to appear more real than reality. They lead to magical or ritualized thinking. They do not allow the proper use of decision making. And when p-values seem to work, they do so because they serve a loose proxies for predictive probabilities, which are proposed as the replacement for p-values.

“Dude, that’s harsh.”

It is, indeed. Here are more words from the The Beginning of the End (i.e. the Introduction):

A book could be written summarizing all of the literature for and against p-values. Here I tackle only the major arguments against p-values. The first arguments are those showing they have no or sketchy justification, that their use reflects, as Neyman originally said, acts of will; that their use is even fallacious These will be less familiar to most readers. The second set of arguments assume the use of p-values, but show the severe limitations arising from that use. These are more common. Why p-values seem to work is also addressed. When they do seem to work it is because they are related to or proxies for the more natural predictive probabilities.

The emphasis in this paper is philosophical not mathematical. Technical mathematical arguments and formula, though valid and of interest, must always assume, tacitly or explicitly, a philosophy. If the philosophy on which a mathematical argument is based is shown to be in error, the “downstream” mathematical arguments supposing this philosophy are thus not independent evidence for or against p-values, and, whatever mathematical interest they may have, become irrelevant.

Trust me, you haven’t seen many of these arguments against p-values.

Glad you asked, friend. It’s the Infinity of Null Hypotheses, which is as damning a proof as can be. But it’s not just a negative proof. It also constructively points the way toward the replacement (predictive methods) and it highlights the hidden notions of cause in statistics, which badly need our understanding. I’m working on a paper on that subject, to highlight the material in the award-eligible book Uncertainty, which all the better sort of people own, or will own.

Here’s a quotation about that proof—but you have access to the full paper, too.

For every measure included in a model, an infinity of measures have been tacitly excluded, exclusions made without benefit of hypothesis tests. Suppose in a regression the observable is patient weight loss, and the measures the usual list of medical and demographic states. One potential measure is the preferred sock color of the third nearest neighbor from the patient’s main residence. It is a silly measure because, we judge using outside common-sense knowledge, that this neighbor’s sock color cannot have any causal bearing on our patient’s weight loss. The point is not that nobody would add such a measure—nobody would—but that it could have been but was excluded without the use of hypothesis testing.

If we can exclude an infinity of hypotheses without hypothesis testing—based on causal decisions using probability notions, mostly—we can exclude the few more we put into a model without testing.

I’ve been getting emails from certain named persons in statistics who think they have found reasons to keep p-values (and mainly ignoring the arguments against in the paper). A popular thrust is to say smart people wouldn’t use something dumb, like p-values. To which I respond smart people do lots of dumb things. And voting doesn’t give truth.

I’m sympathetic to why it seems p-values seem to work—sometimes. When they do it’s because they either mimic predictive methods, or they already agree with the causal knowledge we have in place. That’s in the paper, too.

The moral of the story is: do not use p-values.

Categories: Statistics

### 14 replies »

1. My, you have been busy. Keep up the good work. You seem to be annoying all the right people.

Jaded consumer comment: The ebook version of “Uncertainty” costs more than the hardback plus shipping. Your publisher must use artisinal electrons. And the paperback costs twice what the hardback does. It’s amazing how much more it costs to produce a book with fewer and cheaper materials.

2. Thanks for this paper! Maybe you can correct the DOI to 10.1007/978-3-030-04200-4_2? Currently, the prefix is missing from your bibliographic reference and the DOI lookup does not work without that prefix.

3. Briggs says:

Arne,

Once I figure it out, will do (I just cut and paste what Springer sent). Thanks.

McChuck,

The ebook is Amazon’s idea. Springer has it, too, a cheaper rates. See the book page. Pricing models used by all of them make no sense.

4. Ken says:

From the paper: “The p-value is (usually) the conditional probability an ad hoc test statistic being larger (in absolute value) than the observed statistic, assuming the null hypoth- esis is true, given the values of the observed data, and ASSUMING THE TRUTH OF THE MODEL.” [EMPHASIS added]

The assumption that a model is valid is often made implicitly (without any or very little thought), an issue raised here with models/papers/conclusions reached about “fuzzy” topics relating to social or behavioral themes, a favorite subject here.

In practice, in the so-called “hard” disciplines which I’ll include engineering, p-values have a value in model-building. Basically, if a model (e.g. suitable maintenance intervals for a machine under a particular operating profile) achieves with sufficient data a low p-value the lesson is that collecting more data will not improve the model. Important when data collection involves an expense(s).

That’s a sign to change a parameter(s) and see if the model predicts reasonably under the adjusted parameters (e.g. going from a typical homeowner’s commuting profile to a taxi’s profile to a police car’s profile in a model for engine maintenance planning).

IF the key factors have been properly included, modeled, and then integrated in a system-level model going from a family commuter profile (limited easy driving) to police profile (extended idling with surges of hard accelerations and high-speeds) the predictions should still align. Typically they do not and this indicates where adjustments need to be made for the new scenarios and also back-tested to ensure compatibility with the early profiles, etc.

As such testing and data collection is typically expensive, low p-values indicate when further data collection will not be beneficial — and knowing when to not spend more money on useless additional data without adverse impact is always beneficial.

The value [acceptable cost for] additional data is a statistical consideration notably absent from this blog and its focus.

After a model is refined can predictions within the scope of the model can be accepted with confidence.

For “social justice” themes [a favorite of this blog] low p-values as an excuse to accept of theory are often [very often?] accepted to reach a desired conclusion…not out of objective, but perhaps misguided, misunderstanding of their value. Put another way, the model assumed to be accurate is often the actual objective — so anything that validates the model is accepted blindly (at best) and too often out of deceit.

This is where a sequel to the paper is warranted: How people deceive (themselves and often willfully others) to abuse p-values. And why stick to p-values? The act of modeling, and use of meta-data to build an initial model, has plenty of abuse.

The human factor, and willful deceit, needs to be addressed directly. There’s no shortage this. And where this occurs often becomes extremely costly to society — many of the investment “bubbles” were fueled by nonsensical speculation that many peddling hyped products knew were doomed. The mortgage-backed securities behind the mid-late 2000s real-estate “bubble” is but one such example. General Electric’s recent woes appear to be the result of a similar kind of executive malpractice.

5. DAV says:

p-values have a value in model-building

P-values tell you nothing about the model. At best they give a measure of correlation and if the data are numerous will almost always indicate correlation.

The Hypothesis Test was supposed to rule out causation instead of confirming it through absence of correlation equaling no causation. The use of p-values has been subverted over time and now the claim is the opposite (sufficient correlation equals likely causation) in medicine and the many ologies.

6. Bill_Raynor says:

Please post the raw data you use in your paper.