A review. We have sales data from two campaigns, A and B, data in which we choose (as a premise) to quantify our uncertainty with normal distributions. We assume the “null” hypothesis that the parameters of these two distributions are equal: mA = mB or sA = sB. This says that our uncertainty in sales at A or B is identical. It does not say that A and B are “the same” or “there is no difference” in A and B.
All that is step one of hypothesis testing. Now step two: choose a “test statistic.” This is any function of the data you like. The most popular, in this situation, is some form of the “t-statistic” (there is more than one form). Call our statistic “t”. But you are free to choose one of many, or even make up your own. There is nothing in hypothesis testing theory which requires picking this and not that statistic.
Incidentally, there are practical (and legal) implications over this free choice of test statistic. See this old post for how different test statistics for the same problem were compared in the Wall Street Journal.
Finally, calculate this object:
(4) Pr( |T| > |t| | “null” true, normals, data, statistic)
This is the p-value. In words, it is the probability of seeing a test statistic (T) larger (in absolute value) than the test statistic we actually saw (t) in infinite repetitions of the “experiment” that gave rise to our data, given the “null” hypothesis is exactly true, that normal distributions are the right choice, the actual data we saw, and the statistic we used.
There is no way to make this definition pithy—without sacrificing accuracy. Which most do: sacrifice accuracy, that is. Although it does a reasonable job, Wikipedia, for instance, says, “In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.” This leaves out the crucial infinite repetitions, and the premises of the distribution and test statistic we used. In frequentist definitions of probability, it is always infinity or bust. Probabilities just do not exist for unique or finite events (of course, people always assume that these probabilities exist; but that is because they are natural Bayesians).
Now there has developed a traditional that whenever (4) is less than the magic number, by an act of sheer will, you announce “I reject the null hypothesis,” which is logically equivalent to saying, “I claim that mA does not equal mB” (let’s, as nearly everybody does, just ignore sA and sB).
The magic, never-to-be-questioned number is 0.05, chosen, apparently, by God Himself. If (4) is less than 0.05 you are allowed to claim “statistical significance.” This term means only that (4) is less than 0.05—and nothing else.
There is no theory which claims that 0.05 is best, or that links the size of (4) with the rejection of the “null.” Before we get to that, understand that if (4) is larger than the magic number you must announce, “I fail to reject the ‘null’” but you must never say, “I accept the ‘null.’” This contortion is derived from R.A. Fisher’s love of Karl Popper’s “falsifiability” ideas, ideas which regular readers will recall no longer have any champions among philosophers.
This “failing to reject” is just as much an act of will as “rejecting the ‘null’” was when (4) was less than 0.05. Consider: if I say, as I certainly may say, “mA does not equal mB” I am adding a premise to my list, but this is just as much an act of my will as adding the normal etc. was. (4) is not evidence that “mA does not equal mB“. That is, given (4) the probability “mA does not equal mB” cannot be computed. In fact, it is forbidden (in frequentist theory) to even attempt to calculate this probability. Let’s be clear. We are not allowed to even write
(5) Pr ( “mA does not equal mB” | (4) ) = verboten!
This logically implies, and it is true, that the size of (4) has no relation whatsoever to the proposition “mA does not equal mB.” (See this paper for formal proofs of this.) This is what makes it an act of will that we either declare “mA does not equal mB” or “mA equals mB.”
But, really, why would we want to compute (5) anyway? The customer really wants to know
(6) Pr ( B continuing better than A | data ).
There is nothing in there about unobservable parameters or test statistics, and why should there be? We learn to answer (6) later.
But before we go, let me remind you that we have only begun criticisms of p-values and hypothesis testing. There are lists upon lists of objections. Before you defend p-values, please read through this list of quotations.