William M. Briggs

Statistician to the Stars!

Page 150 of 414

Teaching Journal: Day 9—Hypothesis Testing: Part II

A review. We have sales data from two campaigns, A and B, data in which we choose (as a premise) to quantify our uncertainty with normal distributions. We assume the “null” hypothesis that the parameters of these two distributions are equal: mA = mB or sA = sB. This says that our uncertainty in sales at A or B is identical. It does not say that A and B are “the same” or “there is no difference” in A and B.

All that is step one of hypothesis testing. Now step two: choose a “test statistic.” This is any function of the data you like. The most popular, in this situation, is some form of the “t-statistic” (there is more than one form). Call our statistic “t”. But you are free to choose one of many, or even make up your own. There is nothing in hypothesis testing theory which requires picking this and not that statistic.

Incidentally, there are practical (and legal) implications over this free choice of test statistic. See this old post for how different test statistics for the same problem were compared in the Wall Street Journal.

Finally, calculate this object:

     (4) Pr( |T| > |t|   | “null” true, normals, data, statistic)

This is the p-value. In words, it is the probability of seeing a test statistic (T) larger (in absolute value) than the test statistic we actually saw (t) in infinite repetitions of the “experiment” that gave rise to our data, given the “null” hypothesis is exactly true, that normal distributions are the right choice, the actual data we saw, and the statistic we used.

There is no way to make this definition pithy—without sacrificing accuracy. Which most do: sacrifice accuracy, that is. Although it does a reasonable job, Wikipedia, for instance, says, “In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.” This leaves out the crucial infinite repetitions, and the premises of the distribution and test statistic we used. In frequentist definitions of probability, it is always infinity or bust. Probabilities just do not exist for unique or finite events (of course, people always assume that these probabilities exist; but that is because they are natural Bayesians).

Now there has developed a traditional that whenever (4) is less than the magic number, by an act of sheer will, you announce “I reject the null hypothesis,” which is logically equivalent to saying, “I claim that mA does not equal mB” (let’s, as nearly everybody does, just ignore sA and sB).

The magic, never-to-be-questioned number is 0.05, chosen, apparently, by God Himself. If (4) is less than 0.05 you are allowed to claim “statistical significance.” This term means only that (4) is less than 0.05—and nothing else.

There is no theory which claims that 0.05 is best, or that links the size of (4) with the rejection of the “null.” Before we get to that, understand that if (4) is larger than the magic number you must announce, “I fail to reject the ‘null’” but you must never say, “I accept the ‘null.’” This contortion is derived from R.A. Fisher’s love of Karl Popper’s “falsifiability” ideas, ideas which regular readers will recall no longer have any champions among philosophers.

This “failing to reject” is just as much an act of will as “rejecting the ‘null’” was when (4) was less than 0.05. Consider: if I say, as I certainly may say, “mA does not equal mB” I am adding a premise to my list, but this is just as much an act of my will as adding the normal etc. was. (4) is not evidence that “mA does not equal mB“. That is, given (4) the probability “mA does not equal mB” cannot be computed. In fact, it is forbidden (in frequentist theory) to even attempt to calculate this probability. Let’s be clear. We are not allowed to even write

     (5) Pr ( “mA does not equal mB” | (4) ) = verboten!

This logically implies, and it is true, that the size of (4) has no relation whatsoever to the proposition “mA does not equal mB.” (See this paper for formal proofs of this.) This is what makes it an act of will that we either declare “mA does not equal mB” or “mA equals mB.”

But, really, why would we want to compute (5) anyway? The customer really wants to know

     (6) Pr ( B continuing better than A | data ).

There is nothing in there about unobservable parameters or test statistics, and why should there be? We learn to answer (6) later.

But before we go, let me remind you that we have only begun criticisms of p-values and hypothesis testing. There are lists upon lists of objections. Before you defend p-values, please read through this list of quotations.

Teaching Journal: Day 8—Hypothesis Testing: Part I

Hypothesis testing nicely encapsulates all that is wrong with frequentist statistics. It is a procedure which hides the most controversial assumption/premise. It operates under a “null” belief which nobody believes. It is highly ad hoc and blatantly subjective. It incorporates magic p-values. And it ends all with a pure act of will.

Here is how it works. Imagine (no need, actually: go to the book page and download the advertising.csv file and follow along; to learn to use R, read the book, also free) you have run two advertising campaigns A and B and are interested in weekly sales under these two campaigns. I rely on you to extend this example to other areas. I mean, this one is simple and completely general. Do not fixate on the idea of “advertising.” This explanation works equally well on any comparison.

I want to make the decision which campaign, A or B, to use country-wide and I want to base this decision on 20 weeks of data where I ran both campaigns and collected sales (why 20? it could have been any number, even 1; although frequentist hypothesis testing won’t work with just one observation each; another rank failure or the theory).

Now I could make the rule that whichever campaign had higher median sales is the better. This was B. I could have also made the rule that whichever campaign had higher third-quartile sales is better. This was A. Which is “better” is not a statistical question. It is up to you and the relates to the decisions you will make. So I could also rule that whichever had the higher mean sales was better. This was B. I could have made reference, too, to a fixed number of sales, say 500. Whichever had a greater percentage of sales greater than 500 was “better.” Or whatever else made sense to the bottom line.

Anyway, point is, if all I did was to look at the old data and make direct decisions, I do not need probability or statistics (frequentist or Bayesian). I could just look at the old data and make whatever decision I like.

But that action comes with the implicit premise that “Whatever happened in the past will certainly happen in likewise characteristic in the future.” If do not want to make this premise, and usually I don’t, I then need to invoke probability and ask something like, “Given the data I observed, what is the probability that B will continue to be better than A” if by “better” I mean higher median or mean sales. Or “A will continue to be better” if by “better” I meant higher third-quartile sales.” Or whatever other question makes sense to me about the observable data.

Hypothesis testing (nearly) always by assuming that we can quantify our uncertainty in the outcome (here, sales) with normal distributions. When I say “(nearly) always” I mean statistics as she is actually practiced. This “normality” is a mighty big assumption. It is usually false on the premise that, like here, sales cannot be less than 0. Often sociologists and the like ask questions which force answers from “1 to 5″ (which they magnificently call a “Likert scale”). Two (or more) groups will answer a question, and the uncertainty in the mean of each group is assumed to follow a normal distribution. This is usually wildly false, given that, as we have just said, the numbers cannot be smaller than 1 nor larger than 5.

Normal distributions, then, are often wrong, and often wrong by quite a lot. (And if you don’t yet believe this, I’ll prove it with real data later.) This says that hypothesis testing starts badly. But ignore this badness, or the chance of it, like (nearly) everybody else does and let’s push on.

If it is accepted that our uncertainty in A is quantified by a normal distribution with parameters mA and sA, and similarly B with mB and mB, then the “null” hypothesis is that mA = mB and (usually, but quietly) sA = sB.

Stare at this and be sure to understand what it implies. It DOES NOT say that “A and B are the same.” It says our uncertainty in A and B is the same. This is quite, quite different. Obviously—as in obviously—A and B are not the same. If they were the same we could not tell them apart. This is not, as you might think, a minor objection. Far from it.

Suppose it were true that as the “null” says mA = mB (exactly, precisely equal). Now if sA were not equal to sB, then our uncertainty in A and B can be very different. It could be, depending on the exact values of sA and sB, that the probability of higher sales under A was larger than B, or the opposite could also be true. Stop and understand this.

Just saying something about the central parameters m does not tell us enough, not nearly enough. We need to know what is going on with all four parameters. This is why if we assume that mA = mB we must also assume that sA = sB.

The kicker is that we can never know whether mA = mB or sA = sB; no, not even for a Bayesian. These are unobservable, metaphysical parameters. This means they are unobservable. As in “cannot be seen.” So what do we do? Stick around and discover.

Teaching Journal: Day 7

The joke is old and hoary and so well known that I risk the reader’s ire for repeating it. But it contains a damning truth.

Most academic statistical studies are like a drunk searching for his keys under a streetlight. He looks there not because that is where he lost his keys, but because that is where the light is.

To prove this comes these four quotations from Jacqueline Stevens, professor of political science at Northwestern University (original source):

In 2011 Lars-Erik Cederman, Nils B. Weidmann and Kristian Skrede Gleditsch wrote in the American Political Science Review that “rejecting ‘messy’ factors, like grievances and inequalities,” which are hard to quantify, “may lead to more elegant models that can be more easily tested…”

Professor Tetlock’s main finding? Chimps randomly throwing darts at the possible outcomes would have done almost as well as the experts…

Research aimed at political prediction is doomed to fail. At least if the idea is to predict more accurately than a dart-throwing chimp…

I look forward to seeing what happens to my discipline and politics more generally once we stop mistaking probability studies and statistical significance for knowledge.

If our only evidence is that “Some countries which face economic injustice go to war and Country A is a country which faces economic injustice” then given this the probability that “Country A goes to war” is some number between 0 and 1. And not only is this the best we can do, but it is all we can do. It becomes worse when we realize the vagueness of the term “economic injustice.”

I mean, if we cannot even agree on the implicit (there, but hidden) premise “Economic injustice is unambiguously defined as this and such” we might not even be sure that Country A actually suffers economic injustice.

But supposing we really want to search for the answer to the probability that “Country A goes to war”, what we should not do is to substitute quantitative proxies just to get some equations to spit out numbers. This is no different than a drunk searching under the streetlight.

The mistake is in thinking that not only that all probabilities are quantifiable (which they are not), but that all probabilities should be quantified, which leads to false certainty. And bad predictions.

Incidentally, Stevens also said, “Many of today‚Äôs peer-reviewed studies offer trivial confirmations of the obvious and policy documents filled with egregious, dangerous errors.”

Modeling, which we begin today in a formal sense, is no different than what we have been doing up until now: identifying propositions which we want to quantify the uncertainty of, then identifying premises which are probative of this “conclusion.” As the cautionary tale by Stevens indicates, we must not seek quantification just for the sake of quantification. That is the fundamental error.

A secondary error we saw developed at the end of last week: substituting knowledge about parameters of probability models as knowledge of the “conclusions.” This error is doubled when we realize that the probability models should often not be quantified in the first place. We end up with twice the overconfidence.

Now, if our model and data are that “Most Martians wear hats and George is a Martian” the probability of “George wears a hat” is greater than 1/2 but less than 1. That is the best we can do. And even that relies on the implicit assumption about the meaning of the English word “Most” (of course, there are other implicit assumptions, including definitions of the other words and knowledge of the rules of logic).

This ambiguity—the answer is a very wide interval—is intolerable to many, which is why probability has come to seem subjective to some and why others will quite arbitrarily insert and quantifiable probability model in place of “Most…”

It’s true that both these groups are free to add to the premises such that probabilities of the conclusions do become hard-and-fast numbers. We are all free to add any premises we like. But this makes the models worse in the sense that they match reality at a rate far less than the more parsimonious premises. That, however, is a topic for another day.


Read about all this. More is to come. In another hurry today. Get your data in hand by end of the day. Look for typos.

Teaching Journal: Day 6

(I’m assuming you have been reading previous posts. If not, do so.)

We still want this:

     (1) Pr (Distance > 1 meter | normal with m and s specified) = something

Actually, we don’t; not really. We want somebody to tell us (1) or something like it. The customer doesn’t really care that it was a normal distribution that was used. What we really want are the exact list of premises which all us to say

     (2) Pr (Distance > 1 meter | oracular premises) = 0 or 1

or, that is, we want the oracular premises which tell us the precise distance the boule will be from the cochonette. We want this:

     (2′) Pr (Distance = x meters | oracular premises) = 1

where the x is filled in. But oracular premises don’t exist for most of life. We have to suffice ourselves with something less. This is why we can live with the premise that our uncertainty in the distance is quantified by a normal (or some other) distribution.

We can of course say, “It isn’t really a normal distribution” but this is a conclusion from probability argument, and as we recall all probability propositions are conditional on premises. What are the premises which tell us “It isn’t really a normal distribution” is true? Well, these are easy: we have them (look in the book; Chapter 4). Call this list NN (for “not normal”). That is, Given NN, it is true that “It isn’t really a normal distribution.”

But we do not list NN in (1), (2), or (2′). If we did, we could not compute any numbers. The premises would be self-negating. Just as we do not add the premise “There are no Martians” to the argument “All Martians wear hats and George is a Martian.” Well, we could add it of course. It is up to us, as adding any premise to a list in an argument is always up to us. But the point is this: Given just the original “All Martians…” the conclusion “George wears a hat” is deduced (and is probability 1). And given just the “We use a normal with a specified m and s” the probability the “Distance > 1 meter” is deduced (and is some number).

Incidentally, both the “All Martians…” and the “We use a normal…” are therefore models. So we can see that the word “model” is just another way to say “list of premises.”

When last we left our customer, he had just met a frequentist and a classical Bayesian to which he had put (1). Both the frequentist and the Bayesian declined to answer (1). Instead, the pair starting going on about the value of m (and maybe s, too) by discussing “confidence” and “credible” intervals. None of which are the least interest to the customer, who still wants to know (1). Or questions like (1), questions that have to do with actual distances of actual balls.

The frequentist declines to help, but if pressed might utter something about a “null” hypothesis that “m isn’t 0.” We’ll figure that out later. The classical Bayesian, if he can be jarred awake, can help. What he can do is to say, “Given the data and that I used a normal distribution, and given the assumptions which provides me the same numerical answers as the frequentist, I can say that I don’t know the precise value of the pair—the pair, I say—of (m,s), I can take my uncertainty of them into account to answer (1).”

What this now-modern Bayesian does is to say (m,s) = (m-value 1, s-value 1) with some probability, that (m,s) = (m-value 2, m-value 2) with some probability, and so on for each possible value that (m,s) can take. He knows these from the credible intervals he just calculated. Now for each of these values, he plugs in the guess of (m,s) and calculates (1). Then he takes all the possible values of (1) and weights them by the probability (m,s) take each of these values. In the end he produces

     (3) Pr (Distance > 1 meter | normal and past data) = the answer.

There is no more talk of m and s, which are of no interest to anybody, most specifically the customer. There is only the answer to the question the customer wanted. Notice that this answer is still conditional on the “model”, the normal distribution. It is also conditional on the past data, which is no surprise.

But this means that if originally assumed the premise, “Our uncertainty in the distance is quantified by a gamma distribution” the answer to (3) will be different. Just as it would be different if we began with a Weibull (say) or any other mathematical probability distribution.

Which probability distribution is the “right” one? Well, that is a conclusion to a probability argument. Which premises will we supply to ascertain the probability that that normal, or gamma, or whatever, is the “right” one? That again is up to us. We’ll talk more about this in detail at another time. But for now first suppose we have the evidence/premises, “I have three probability models, normal, gamma, and Weibull. Just one of these is the right one to quantify uncertainty in distance.” Given just this information, the probability that any is right is 1/3.

We could then take this information and compute a (3) for each model, then weight the three answers (the three numerical answers to (3)) to produce this

     (4) Pr (Distance > 1 meter | assumptions about distributions and past data) = better answer.

Notice that there is no talk about which distributions make up (4). They disappeared just as the m and s disappeared when we went from (1) to (3).

The point: every statistical problem the modern Bayesian does is just like this. He attempts to answer the actual questions real customers ask him.


Check for typos.

Wine tour today.

Also, have your spreadsheets ready for tomorrow.

Another Try With A New Look

Unless there is a general revolt, this is it. Tweaks can of course be made—fonts darkened or lightened, background colors shaded, some widgets shifted. But this is it.

One of the big reasons I had to switch is because the old format was difficult to use on phones, tablets, and the like. This one looks fantastic on my HTC, and from what I can tell, soars on iPads. It is also swell on screen. And all is automatic. I mean, there shouldn’t have to be any “pinching” or “tapping” to have the words show properly,

There certainly isn’t anything fancy about this theme, but then we don’t really do fancy. Focus is still on the words and the occasional graphic.

I want to put a guide at the bottom of the comments to show the allowable tags. The blockquote is annoyingly always italicized. I don’t like while in a post the arrows to previous and more recent posts. The images on the right bar with rounded edges have the wrong color for their edges. Things like that. They’ll get fixed.

This would have all been done earlier, but the class is taking all my time.

Speaking of the class, has anybody collected any data?

« Older posts Newer posts »

© 2014 William M. Briggs

Theme by Anders NorenUp ↑