Susan Holmes has done us a service by writing clearly the philosophy of the p-value in her new paper “Statistical Proof? The Problem of Irreproducibility” in the Bulletin of the American Mathematical Society (Volume 55, Number 1, January 2018, Pages 31–55).
The thesis of her paper, about which I am in the fullest possible support, is this: “Data currently generated in the fields of ecology, medicine, climatology, and neuroscience often contain tens of thousands of measured variables. If special care is not taken, the complexity associated with statistical analysis of such data can lead to publication of results that prove to be irreproducible.”
About how to fix the problem we disagree. I say it won’t be any kind of p-value, or p-value-like creation.
Here from the opening are the clear words:
Statisticians are willing to pay “some chance of error to extract knowledge” (J. W. Tukey [87]) using induction as follows.
If, given (A => B), then the existence of a small ε such that P(B) < ε tells us that A is probably not true.
This translates into an inference which suggests that if we observe data X, which is very unlikely if A is true (written P(X|A) < ε), then A is not plausible. [A footnote to this sentence is pasted next.]
We do not say here that the probability of A is low; as we will see in a standard frequentist setting, either A is true or not and fixed events do not have probabilities. In the Bayesian setting we would be able to state a probability for A.
I agree with her definition of the p-value. In notation, the words (of the third paragraph) translate to this:
(1) Pr(A|X & Pr(X|A) = small) = small.
The argument behind this equation is fallacious. To see why, first convince yourself the notation is correct.
I also agree—with a loud yes!—that under the theory of frequentism “fixed events do not have probabilities.”
But in reality, of course they do. Every frequentist acts as if they do when they say things like “A is not plausible”. Not plausible is a synonym for not likely, which is a synonym for of low probability. In other words, every time a frequentist uses a p-value, he makes a probability judgement, which is forbidden by the theory he claims to hold.
Limiting relative frequency, as we have discussed many times, and often in Uncertainty, is an incorrect theory of probability. But let that pass. Believe it if you like; say that singular events like A cannot have probabilities (which does follow from the theory), and then give A a (non-quantified) probability after all. Let’s pretend we do not see the inconsistency.
Let’s instead examine (1). It helps to have an example. Let A be the theory “There is a six-sided object that when activated must show one of the six sides, just one of which is labeled 6.” And, for fun, let X = “6 6s in a row.” Then Pr(X|A) = small, where “small” is much weer than the magic number (about 2×10^-5). So we want to calculate
(1) Pr(A|6 6s on six-sided device & Pr(6 6s|A) = 2×10^-5) = ?
Well, it should be obvious there is no (direct) answer to (1). Unless we magnify some implicit premises, or add new ones entirely.
The right-hand-side (the givens) tell us that if accept A as true, then 6 6s are a possibility; and so when we see 6 6s, if anything, it is evidence in favor of A’s truth. After all, something A said could happen did happen!
Another implicit premise might be that in noticing we just rolled 6 6s in a row, there were other possibilities. We also notice we can’t identify the precise causes of the 6s showing, but understand the causes are related to standard physics. These implicit premises can be used to infer A.
We now come to the classic objection, which is that no alternative to A is given. A is the only thing going. Unless we add new implicit premises that give us a hint about something beside A. Whatever this premise is, it cannot be “Either A is true or something else is”, because that is a tautology, and in logic adding a tautology to the premises is like multiplying an equation by 1. It changes nothing.
Not only that, if you told a frequentist that you were rejecting A because you just saw 6 6s in the row, and that therefore “another number is due”, he’d probably accuse you of falling prey to the gambler’s fallacy. Again, we cannot expect consistency in any limiting relative frequency argument.
But what’s this about the gambler’s fallacy? That can only be judged were we to add more information to the right hand side of (1). This is the key. Everything we are using as evidence for or against A goes on the right hand side of (1). Even if it is not written, it is there. This is often forgotten in the rush to make everything mathematical.
In our case, to have any evidence of the gambler’s fallacy would entail adding evidence to the RHS of (1) that is similar to, “We’re in a casino, where I’m sure they’re real careful about the dice, replacing worn and even ‘lucky’ ones, and they way they make you throw the dice make it next to impossible to control the outcome”. That’s only a small summary of a large thought. All evidence that points to A.
But what if we’re over on 34th street at Tannen’s Magic Store and we’ve just seen the 6 6s, or even 20 6s, or however many you like? The RHS of (1), for you in that situation, changes dramatically, adding possibilities other than A.
In short, it is not the observations alone in (1) that get you anywhere. It is the extra information you add that works the magic, as it were. And whatever you add to (1), (1) is no longer (1), but something else. If you understand that, you understand all. P-values are a dead end.
Bonus argument This similar argument I wrote appears in many places, including in a new paper about which more another day:
Fisher said: “Belief in null hypothesis as an accurate representation of the population sampled is confronted by a logical disjunction: Either the null is false, or the p-value has attained by chance an exceptionally low value.” Something like this is repeated in every elementary textbook.
Yet Fisher’s “logical disjunction” is evidently not one, since his either-or describes different propositions, i.e. the null and p-values. A real disjunction can however be found. Re-writing Fisher gives: Either the null is false and we see a small p-value, or the null is true and we see a small p-value. Or just: Either the null is true or it is false and we see a small p-value. Since “Either the null is true or it is false” is a tautology, and is therefore necessarily true no matter what, and because prefixing any argument with a tautology does not change that argument’s logical status, we are left with, “We see a small p-value.” The p-value thus casts no light on the truth or falsity of the null. Everybody knows this, but this is the formal proof of it.
Frequentist theory claims, assuming the truth of the null, we can equally likely see any p-value whatsoever, i.e. the p-value under the null is uniformly distributed. To emphasize: assuming the truth of the null, we deduce we can see any p-value between 0 and 1. And since we always do see any value, all p-values are logically evidence for the null and not against it. Yet practice insists small p-value are evidence the null is (likely) false. That is because people argue: For most small p-values I have seen in the past, I believe the null has been false; I now see a new small p-value, therefore the null hypothesis in this new problem is likely false. That argument works, but it has no place in frequentist theory (which anyway has innumerable other difficulties). It is the Bayesian-like interpretation.
The decisions made using p-values are thus an “act of will”, as Neyman criticized, not realizing his own method of not-rejecting and rejecting nulls had the same flaw.
What to use instead? Pure probability, baby. See our class for examples. Or read all about it in Uncertainty.
Here’s another view: p-values are not about the null hypothesis, the probability of which is sometimes not of much interest when we feel sure it is zero for all practical purposes. The p-value is merely one modest device for assessing the evidence in a sample of data, regardless of one’s prior beliefs. Assuming the various conditions have been met (generally the greatest weakness of this exercise imho), then the lower the p-value, the more support for the sample as a piece of evidence in support of the alternative hypothesis. That’s not much, but it is something.
This blog needs reference sections in which key essays of a given topic are indexed and made available for quick easy reference (implying by “reference” such sections to limited to truly useful content, not necessarily all on the given topic).
Wonderful: “And since we always do see any value, all p-values are logically evidence for the null and not against it.”
But: “Again, we cannot expect consisting in any limiting relative frequency argument.” The word ‘consistency’ is meant instead of ‘consisting’ in that sentence.
On a related topic: I begin to think that you should consider declaring forthrightly that the expression “the null” has no more meaning than the expression “due to ‘chance'”. After all, in common parlance, “the null” appears to have no meaning, aside from “due to ‘chance'”.
The preceding paragraph was prompted by this: “…therefore the null hypothesis in this new problem is likely false. That argument works….” But how can an argument ‘work’ when it is about a term (“the null hypothesis”) that is devoid of meaning, at least as the term is commonly understood both by Bayesians and frequentists?
“The null” can only mean: the imponderable infinity of other possible causes besides the ones I stated in my model.
Providing that definition of “the null” upfront is, I think, the only sure way to avoid generating difficulties in argument.
One of your meta-arguments is that the former ways of thinking about statistics self-generate dilemmas, paradoxes, and confusions that can not be resolved within their own conceptual frameworks.
Thus, you confront, among other things, the problem of translatability. What you are saying can not be completely translated into either frequentist or Bayesian terms. Their problematics and terminology, while partly valuable, are fundamentally inadequate to your task.
The philosopher Alasdair MacIntyre, particularly in “Whose Justice? Which Rationality?” if I remember, did some valuable work in that area. One crucial task of a rival inquiry is to understand the former inquiry better than the former inquiry’s proponents can. That is, the new inquiry can explain why it is that the former inquiry can not, and will not be able to, adequately resolve its own professed problematic, on its own terms. (The work you and others have done regarding frequentism’s built-in inconsistencies is a shining example of this.)
I am arguing that sometimes, you just have to bite the bullet, and refuse the former terminology entirely. Yes, you want to build bridges, but no, you don’t want to defeat your purpose or muddy your argument, either. I think you should get more serious about (and the pun works) always and everywhere rejecting (the term) “the null hypothesis”.
Ken,
Excellent point. I’ve tried to start that at the About & Classics page. It needs work.
JohnK.
Another typo!
No! No! No!… You are still confusing the p-value with statistical significance. They are different things. (See a video here [ https://youtu.be/lZQcYQz_zUc ] and a small article here [ https://doi.org/10.3389/fpsyg.2017.01434 ]).
Indeed, the p-value can accept the claimed tautology: “Either the null is false and we see a small p-value, or the null is true and we see a small p-value. Or just: Either the null is true or it is false and we see a small p-value.” This is so because the p-value is rather a descriptive statistic of how your sample scored assuming the null is true (i.e., the “tautology” is subsumed, thus evaporating in so doing, into “the null is true and we see a small p-value” — see, for example, https://doi.org/10.3389/fpsyg.2015.00341 ).
Statistical significance is an external decision (including whether we are following convention rather than actively deciding) whereby we set up a threshold for calling a result significant or not. That is, statistical significance helps us build a Modus Tollens. The p-value is merely a proxy to locate the research result either side of the significance threshold. Therefore, either the result is significant or not. If significant, we take the null to be false. If not, we make no statement in favor (Fisher) or we act as if, i.e. take, the null to be true (Neyman-Pearson, pending we also have adequate power).
Notice also that the Modus Tollens also works with true, certain, hypotheses, not probabilistic ones. Furthermore, because of the independence between p-values and significance, calling a p-value ‘significant’ needs of the significance context as reference: i.e., p = 0.05 may not be significant if the researcher is using 1% as the level of significance (or a 5‰, or whatever).
Jose,
Nope, no confusion. Spoke of these things long and often (and in Uncertainty).
Worse, you can’t get a Modus Tollens from a fallacious argument. The p-value isn’t what you think it is; it is merely an act of will, as is “statistical significance.”
In action, what happens is people reason in a Bayesian/predictive way. All this I’ve discussed elsewhere. As Ken pointed out, check out the About & Classics page for more material.
Hmmm… wmbriggs.com/post/20492/ – Pascal’s Mugging Is Silly: Events Don’t “Have” Probabilities
A = “There is a six-sided object that when activated must show one of the six sides, just one of which is labeled 6.”
What does it mean by “in favor of A’s truth”?
Contrapositively, it is not an evidence in favor of A’s truth, when we don’t see six 6s. Right? One has to ask –
(1) How about when we see zero 6 or one 6 or two 6s or three 6s or four 6s or five 6s?
(2) The above statement also holds when we see twenty 6s?! Say, someone offers a gambling game, in which (1) you would lose your bet if he rolls a 6, and (2) you keep your bet and he pays you an amount equivalent to your bet. After losing 20 times, you would continue to play the game?
It is true that an inference that A can be rejected as improbable doesn’t offer any alternative. It would be nice if a method could answer all questions. However, such inference can be sufficient for certain decision-making.
The paper by Holmes is worth reading. It includes great examples. Thank you for bringing it to my attention.
(2) you keep your bet and he pays you an amount equivalent to your bet if he doesn’t roll a 6.
“What to use instead? Pure probability, baby”. I agree, if probability is what you are after.
But I also think that your (1) is a misrepresentation by including the Pr. Holmes is -with good reason- carefully talking about by using “implausible”. To me “implausible” is not what you state above:
“Not plausible is a synonym for not likely, which is a synonym for of low probability.”
To me this makes your argument semantic, or at least not a “proof”.
To explicate: Say X is one says she observed 1 tail when tossing a fair coin and know that an A is responsible. Say A’ is “A series of 100 heads were thrown in a row with that coin” and A” is “99 tails were thrown in a row with that coin”. Both A’ and A” have very low probability indeed by themselves regardless of any evidence, but A” is more plausible to be true given X even if A” is still incredibly unlikely (and that in reality A”’ is the reason for X which was “1 toss with a fair coin”).
So to my understanding (1) should be A being TRUE |X & Pr(X|A) = small) = implausible is the decision made in “statistical significance”, not the (1) you write. If your (1) were what one is after, then yes, pure probability, baby.