I’ve lost count of the lesson numbers.
The definition of a p-value, here phrased in the incorrectly named “test of difference in means”, is:
Given the truth of a probability model used to represent the uncertainty in the observables in group A and in group B, and given that one or more of the parameters of those probability models are equal, and given that the “experiment” from which the data collected in A and B were to be repeated a number of times approaching the limit, the probability of seeing a statistic calculated from each of these repetitions larger (in absolute value) than the one we actually found.
Incidentally, there is no need to ever “test” for the difference in means, because means can be computed; that is, observed. You can tell at a glance whether they are different. The actual hypothesis test says something indirect about parameters. Anyway, if the parameters are equal, the model for the two groups are the same. In this case, it is called the “null” model (statisticians are rarely clever in naming their creations).
There are a number of premises in the p-value definition, some more controversial than others. Begin with the truth of the model.
It is rare to have deduced the model that represents the uncertainty in some observable. To deduce a model requires outside evidence or premises. Usually, this evidence is such that we can only infer the truth of a model. And even more cases, we know, based on stated evidence, that a model is false.
Now, if this is so—if the model is known to be false—then the p-value cannot be computed. Oh, the mechanics of the calculation can still be performed. But the truth of the output in conjunction with the truth of model is false. If the model, again based on stated evidence, is only probably true, then the p-value can be calculated, but its truth and the truth of the model must be accounted for, and almost never is.
Equating the parameters is relatively uncontroversial, but small p-values are taken to mean the parameters are not equal; yet the p-value offers no help about how different they are. In any case, this is one of the exceedingly well know standard objections to p-values, which I won’t rehearse here. Next is a better argument unknown to most.
Fisher often said something like this (you can find a modified version of this statement in any introductory book):
Belief in the null model as an accurate representation of the population sampled is confronted by a logical disjunction: Either the null model is false, or the p-value has attained by chance an exceptionally low value.
Many have found this argument convincing. They should not have. First, this “logical disjunction” is evidently not one, because the first part of the sentence makes a statement about the unobservable null model, and the second part makes a statement about the observable p-value. But it is clear that there are implied missing pieces, and his quote can be fixed easily like this:
Either the null model is false and we see a small p-value, or the null model is true and we see a small p-value.
And this is just
Either the null model is true or it is false and we see a small p-value.
But since “Either the null model is true or it is false” is a tautology (it is always true), and since any tautology can be conjoined with any statement and not change its truth value, what we have left is
We see a small p-value.
Which is of no help at all. The p-value casts no direct light on the truth or falsity of the null model. This result should not be surprising, because remember that Fisher argued that the p-value could not deduce whether the null was true; but if it cannot deduce whether the null is true, it cannot, logically, deduce whether it is false; that is, the p-value cannot falsify the null model (which was his main hope in creating them).
Recall that making probabilistic statements about the truth value of parameters or models is forbidden in classical statistics. An argument pulled from classical theory illustrates this.
If the null model is true, then the p-value will be between 0 and 1.
We see a small p-value.
——————————————————————————————————————-
The null model is false.
Under the null, the p-value is uniformly distributed (the first premise); which is another way of saying, “If the null is true, we will see any p-value whatsoever.” That we see any value thus gives no evidence for the conclusion.
Importantly, the first premise is not that “If the null model is true, then we expect a ‘large’ p-value,” because we clearly do not.
Since p-values—by design!—give no evidence about the truth or the falsity of the null model, it’s a wonder that their use ever caught on. But there is a good reason why they did. That’s for next time.
Did you come across John D Cook yet? He sounds much like you, sometimes. I wanted to know how to estimate the probability of an event that hasn’t happened (though I know it can). His answer’s here: http://www.johndcook.com/blog/2010/03/30/statistical-rule-of-three/
But, of course, I couldn’t resist wandering off after “Four Reasons to Use Bayesian Inference” and others.
Okay, this is one of those times when I think I might understand, but am not sure. If you (or your readers) would point out any flaws in my thinking/understanding, I’d be very much obliged.
First, two versions of “what a p-value is” just for fun:
(a) The typical “stats 101” understanding of the p-value is roughly “if the p-value is small, then the null hypothesis (usually equality of the means of two groups) is false with probability 1-p”.
(b) The slightly more sophisticated version is “if the assumptions underlying the model under which the p-value is computed are true, and the p-value is small, then the p-value is the probability of observing a statistic as large or larger than we actually observed if the null hypothesis were actually true. Of course, we never know that the assumptions are true, but we can have the software spit out tests of non-normality and heteroscedasticity etc., and unless those themselves come back with unusually large test statistics, we can be comfortable that the p-value is just about accurate (and ignore the apparent infinite regress of test statistics founded on assumptions that can only be evaluated with respect to other test statistics based on other assumptions).”
And (c) my version of your statement…
* “if the assumptions underlying the model under which the p-value is computed are true (Given the truth of a probability model used to represent the uncertainty in the observables in group A and in group B),
* and if the null hypothesis were actually true (“and given that one or more of the parameters of those probability models are equal”),
* then the p-value is the likelihood of obtaining a statistic as large or larger than we did in this experiment (“and given that the “experiment†from which the data collected in A and B were to be repeated a number of times approaching the limit, the probability of seeing a statistic calculated from each of these repetitions larger (in absolute value) than the one we actually found.”).
I’m pretty sure that my understanding is equivalent through points one and two, but the third one requires that “given this experiment” is equivalent to “given that the “experiment†from which the data collected in A and B were to be repeated a number of times approaching the limit”). Logically, they aren’t necessarily equivalent. But is there a reason that the “repeated a number of times approaching the limit” is necessary? Assuming that we could rest easy on the say-so of Levene and Kolmogorov-Smirnov (or on whatever authority we base non-violation of the distributional assumptions of the model), why can’t we just let the theoretical distribution of test statistics stand in for the observed distribution as the experiment is “repeated a number of times approaching the limit?”
Thanks again for the fascinating blog. As you can see, I’m a partially reconstructed user of frequentist statistics, trying to get my head around Bayesian thinking.
A mathematician would say that this is lesson n+1, where n is the number of the previous lesson.
I remember p value from college genetics class. I also remember being somewhat confused and that it was kind of a cookbook process getting to your answer.
Ok, so you do a dihybrid cross with your little Drosophila in genetics lab. You know your supposed to have a proportion of 9:3:3:1. You wind up with 500 little flies and 33 of them are double recessive. From what I remember, you then do a calculation and check some kind of table and then you know if you cross conforms to the theory. Is the table where the p values are?