Teaching Journal: Day 9—Hypothesis Testing: Part II

A review. We have sales data from two campaigns, A and B, data in which we choose (as a premise) to quantify our uncertainty with normal distributions. We assume the “null” hypothesis that the parameters of these two distributions are equal: mA = mB or sA = sB. This says that our uncertainty in sales at A or B is identical. It does not say that A and B are “the same” or “there is no difference” in A and B.

All that is step one of hypothesis testing. Now step two: choose a “test statistic.” This is any function of the data you like. The most popular, in this situation, is some form of the “t-statistic” (there is more than one form). Call our statistic “t”. But you are free to choose one of many, or even make up your own. There is nothing in hypothesis testing theory which requires picking this and not that statistic.

Incidentally, there are practical (and legal) implications over this free choice of test statistic. See this old post for how different test statistics for the same problem were compared in the Wall Street Journal.

Finally, calculate this object:

     (4) Pr( |T| > |t|   | “null” true, normals, data, statistic)

This is the p-value. In words, it is the probability of seeing a test statistic (T) larger (in absolute value) than the test statistic we actually saw (t) in infinite repetitions of the “experiment” that gave rise to our data, given the “null” hypothesis is exactly true, that normal distributions are the right choice, the actual data we saw, and the statistic we used.

There is no way to make this definition pithy—without sacrificing accuracy. Which most do: sacrifice accuracy, that is. Although it does a reasonable job, Wikipedia, for instance, says, “In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.” This leaves out the crucial infinite repetitions, and the premises of the distribution and test statistic we used. In frequentist definitions of probability, it is always infinity or bust. Probabilities just do not exist for unique or finite events (of course, people always assume that these probabilities exist; but that is because they are natural Bayesians).

Now there has developed a traditional that whenever (4) is less than the magic number, by an act of sheer will, you announce “I reject the null hypothesis,” which is logically equivalent to saying, “I claim that mA does not equal mB” (let’s, as nearly everybody does, just ignore sA and sB).

The magic, never-to-be-questioned number is 0.05, chosen, apparently, by God Himself. If (4) is less than 0.05 you are allowed to claim “statistical significance.” This term means only that (4) is less than 0.05—and nothing else.

There is no theory which claims that 0.05 is best, or that links the size of (4) with the rejection of the “null.” Before we get to that, understand that if (4) is larger than the magic number you must announce, “I fail to reject the ‘null'” but you must never say, “I accept the ‘null.'” This contortion is derived from R.A. Fisher’s love of Karl Popper’s “falsifiability” ideas, ideas which regular readers will recall no longer have any champions among philosophers.

This “failing to reject” is just as much an act of will as “rejecting the ‘null'” was when (4) was less than 0.05. Consider: if I say, as I certainly may say, “mA does not equal mB” I am adding a premise to my list, but this is just as much an act of my will as adding the normal etc. was. (4) is not evidence that “mA does not equal mB“. That is, given (4) the probability “mA does not equal mB” cannot be computed. In fact, it is forbidden (in frequentist theory) to even attempt to calculate this probability. Let’s be clear. We are not allowed to even write

     (5) Pr ( “mA does not equal mB” | (4) ) = verboten!

This logically implies, and it is true, that the size of (4) has no relation whatsoever to the proposition “mA does not equal mB.” (See this paper for formal proofs of this.) This is what makes it an act of will that we either declare “mA does not equal mB” or “mA equals mB.”

But, really, why would we want to compute (5) anyway? The customer really wants to know

     (6) Pr ( B continuing better than A | data ).

There is nothing in there about unobservable parameters or test statistics, and why should there be? We learn to answer (6) later.

But before we go, let me remind you that we have only begun criticisms of p-values and hypothesis testing. There are lists upon lists of objections. Before you defend p-values, please read through this list of quotations.


  1. This reminds me of philosophical discussions of the basis of quantum mechanics. The standard reply is “shut up and calculate”.

  2. This all looks fine to me except for two things:

    One is that “the crucial infinite repetitions” only applies if one has a strictly “frequentist” definition of probability – which (despite your frequent references to it) is a philosophical position which I do NOT see adopted by the vast majority of mathematical statisticians – even those who prefer a traditional non-Bayesian approach to inference.

    The other is that any correctly formulated claim of “statistical significance” includes a statement of the significance level – which does NOT always have to be 5%.

    I appreciate your concern with providing what the “customer really wants to know”, and look forward to seeing whether or not you can do it. I am not totally without hope on this but it is still not clear to me that your notion of logical probability really makes sense – ie that the “statistical syllogism” in your paper on “non-arbitrary assignment of equi-probable priors” applies generally enough to cover realistic situations.

    In any case if it works it will stand more strongly when presented on its own merits rather than by comparison with a false representation of the alternative.

  3. I had similar thoughts as Alan. I have met and worked with many statisticians, but never have encountered one who described himself as a frequentist. Modern statisticians ground their analyses in probability theory, in the sense of Kolmogorov, and don’t rely on infinite coin flips and the like.

    On the other hand, while p-values are cited as Alan says, there really is a bizzare aura about p=.05 out there.

  4. On the infinity thing, I’ve always understood (that’s an assumption right there) that point of the (frequentist) exercise was to estimate population parameters from sample parameters in order to be able to say something about the whole population while having only a sample from it. And the problem arises because some populations are by definition infinite so “this is the answer we would have got if we’d sampled the whole population” means “if we did an infinite number of tests”.

    “That’s what I think” said Pooh. “But I expect I’m wrong” he said.

Leave a Comment

Your email address will not be published. Required fields are marked *