William M. Briggs

Statistician to the Stars!

Page 155 of 419

Teaching Journal: Day 8—Hypothesis Testing: Part I

Hypothesis testing nicely encapsulates all that is wrong with frequentist statistics. It is a procedure which hides the most controversial assumption/premise. It operates under a “null” belief which nobody believes. It is highly ad hoc and blatantly subjective. It incorporates magic p-values. And it ends all with a pure act of will.

Here is how it works. Imagine (no need, actually: go to the book page and download the advertising.csv file and follow along; to learn to use R, read the book, also free) you have run two advertising campaigns A and B and are interested in weekly sales under these two campaigns. I rely on you to extend this example to other areas. I mean, this one is simple and completely general. Do not fixate on the idea of “advertising.” This explanation works equally well on any comparison.

I want to make the decision which campaign, A or B, to use country-wide and I want to base this decision on 20 weeks of data where I ran both campaigns and collected sales (why 20? it could have been any number, even 1; although frequentist hypothesis testing won’t work with just one observation each; another rank failure or the theory).

Now I could make the rule that whichever campaign had higher median sales is the better. This was B. I could have also made the rule that whichever campaign had higher third-quartile sales is better. This was A. Which is “better” is not a statistical question. It is up to you and the relates to the decisions you will make. So I could also rule that whichever had the higher mean sales was better. This was B. I could have made reference, too, to a fixed number of sales, say 500. Whichever had a greater percentage of sales greater than 500 was “better.” Or whatever else made sense to the bottom line.

Anyway, point is, if all I did was to look at the old data and make direct decisions, I do not need probability or statistics (frequentist or Bayesian). I could just look at the old data and make whatever decision I like.

But that action comes with the implicit premise that “Whatever happened in the past will certainly happen in likewise characteristic in the future.” If do not want to make this premise, and usually I don’t, I then need to invoke probability and ask something like, “Given the data I observed, what is the probability that B will continue to be better than A” if by “better” I mean higher median or mean sales. Or “A will continue to be better” if by “better” I meant higher third-quartile sales.” Or whatever other question makes sense to me about the observable data.

Hypothesis testing (nearly) always by assuming that we can quantify our uncertainty in the outcome (here, sales) with normal distributions. When I say “(nearly) always” I mean statistics as she is actually practiced. This “normality” is a mighty big assumption. It is usually false on the premise that, like here, sales cannot be less than 0. Often sociologists and the like ask questions which force answers from “1 to 5″ (which they magnificently call a “Likert scale”). Two (or more) groups will answer a question, and the uncertainty in the mean of each group is assumed to follow a normal distribution. This is usually wildly false, given that, as we have just said, the numbers cannot be smaller than 1 nor larger than 5.

Normal distributions, then, are often wrong, and often wrong by quite a lot. (And if you don’t yet believe this, I’ll prove it with real data later.) This says that hypothesis testing starts badly. But ignore this badness, or the chance of it, like (nearly) everybody else does and let’s push on.

If it is accepted that our uncertainty in A is quantified by a normal distribution with parameters mA and sA, and similarly B with mB and mB, then the “null” hypothesis is that mA = mB and (usually, but quietly) sA = sB.

Stare at this and be sure to understand what it implies. It DOES NOT say that “A and B are the same.” It says our uncertainty in A and B is the same. This is quite, quite different. Obviously—as in obviously—A and B are not the same. If they were the same we could not tell them apart. This is not, as you might think, a minor objection. Far from it.

Suppose it were true that as the “null” says mA = mB (exactly, precisely equal). Now if sA were not equal to sB, then our uncertainty in A and B can be very different. It could be, depending on the exact values of sA and sB, that the probability of higher sales under A was larger than B, or the opposite could also be true. Stop and understand this.

Just saying something about the central parameters m does not tell us enough, not nearly enough. We need to know what is going on with all four parameters. This is why if we assume that mA = mB we must also assume that sA = sB.

The kicker is that we can never know whether mA = mB or sA = sB; no, not even for a Bayesian. These are unobservable, metaphysical parameters. This means they are unobservable. As in “cannot be seen.” So what do we do? Stick around and discover.

Teaching Journal: Day 7

The joke is old and hoary and so well known that I risk the reader’s ire for repeating it. But it contains a damning truth.

Most academic statistical studies are like a drunk searching for his keys under a streetlight. He looks there not because that is where he lost his keys, but because that is where the light is.

To prove this comes these four quotations from Jacqueline Stevens, professor of political science at Northwestern University (original source):

In 2011 Lars-Erik Cederman, Nils B. Weidmann and Kristian Skrede Gleditsch wrote in the American Political Science Review that “rejecting ‘messy’ factors, like grievances and inequalities,” which are hard to quantify, “may lead to more elegant models that can be more easily tested…”

Professor Tetlock’s main finding? Chimps randomly throwing darts at the possible outcomes would have done almost as well as the experts…

Research aimed at political prediction is doomed to fail. At least if the idea is to predict more accurately than a dart-throwing chimp…

I look forward to seeing what happens to my discipline and politics more generally once we stop mistaking probability studies and statistical significance for knowledge.

If our only evidence is that “Some countries which face economic injustice go to war and Country A is a country which faces economic injustice” then given this the probability that “Country A goes to war” is some number between 0 and 1. And not only is this the best we can do, but it is all we can do. It becomes worse when we realize the vagueness of the term “economic injustice.”

I mean, if we cannot even agree on the implicit (there, but hidden) premise “Economic injustice is unambiguously defined as this and such” we might not even be sure that Country A actually suffers economic injustice.

But supposing we really want to search for the answer to the probability that “Country A goes to war”, what we should not do is to substitute quantitative proxies just to get some equations to spit out numbers. This is no different than a drunk searching under the streetlight.

The mistake is in thinking that not only that all probabilities are quantifiable (which they are not), but that all probabilities should be quantified, which leads to false certainty. And bad predictions.

Incidentally, Stevens also said, “Many of today‚Äôs peer-reviewed studies offer trivial confirmations of the obvious and policy documents filled with egregious, dangerous errors.”

Modeling, which we begin today in a formal sense, is no different than what we have been doing up until now: identifying propositions which we want to quantify the uncertainty of, then identifying premises which are probative of this “conclusion.” As the cautionary tale by Stevens indicates, we must not seek quantification just for the sake of quantification. That is the fundamental error.

A secondary error we saw developed at the end of last week: substituting knowledge about parameters of probability models as knowledge of the “conclusions.” This error is doubled when we realize that the probability models should often not be quantified in the first place. We end up with twice the overconfidence.

Now, if our model and data are that “Most Martians wear hats and George is a Martian” the probability of “George wears a hat” is greater than 1/2 but less than 1. That is the best we can do. And even that relies on the implicit assumption about the meaning of the English word “Most” (of course, there are other implicit assumptions, including definitions of the other words and knowledge of the rules of logic).

This ambiguity—the answer is a very wide interval—is intolerable to many, which is why probability has come to seem subjective to some and why others will quite arbitrarily insert and quantifiable probability model in place of “Most…”

It’s true that both these groups are free to add to the premises such that probabilities of the conclusions do become hard-and-fast numbers. We are all free to add any premises we like. But this makes the models worse in the sense that they match reality at a rate far less than the more parsimonious premises. That, however, is a topic for another day.

Homework

Read about all this. More is to come. In another hurry today. Get your data in hand by end of the day. Look for typos.

Teaching Journal: Day 6

(I’m assuming you have been reading previous posts. If not, do so.)

We still want this:

     (1) Pr (Distance > 1 meter | normal with m and s specified) = something

Actually, we don’t; not really. We want somebody to tell us (1) or something like it. The customer doesn’t really care that it was a normal distribution that was used. What we really want are the exact list of premises which all us to say

     (2) Pr (Distance > 1 meter | oracular premises) = 0 or 1

or, that is, we want the oracular premises which tell us the precise distance the boule will be from the cochonette. We want this:

     (2′) Pr (Distance = x meters | oracular premises) = 1

where the x is filled in. But oracular premises don’t exist for most of life. We have to suffice ourselves with something less. This is why we can live with the premise that our uncertainty in the distance is quantified by a normal (or some other) distribution.

We can of course say, “It isn’t really a normal distribution” but this is a conclusion from probability argument, and as we recall all probability propositions are conditional on premises. What are the premises which tell us “It isn’t really a normal distribution” is true? Well, these are easy: we have them (look in the book; Chapter 4). Call this list NN (for “not normal”). That is, Given NN, it is true that “It isn’t really a normal distribution.”

But we do not list NN in (1), (2), or (2′). If we did, we could not compute any numbers. The premises would be self-negating. Just as we do not add the premise “There are no Martians” to the argument “All Martians wear hats and George is a Martian.” Well, we could add it of course. It is up to us, as adding any premise to a list in an argument is always up to us. But the point is this: Given just the original “All Martians…” the conclusion “George wears a hat” is deduced (and is probability 1). And given just the “We use a normal with a specified m and s” the probability the “Distance > 1 meter” is deduced (and is some number).

Incidentally, both the “All Martians…” and the “We use a normal…” are therefore models. So we can see that the word “model” is just another way to say “list of premises.”

When last we left our customer, he had just met a frequentist and a classical Bayesian to which he had put (1). Both the frequentist and the Bayesian declined to answer (1). Instead, the pair starting going on about the value of m (and maybe s, too) by discussing “confidence” and “credible” intervals. None of which are the least interest to the customer, who still wants to know (1). Or questions like (1), questions that have to do with actual distances of actual balls.

The frequentist declines to help, but if pressed might utter something about a “null” hypothesis that “m isn’t 0.” We’ll figure that out later. The classical Bayesian, if he can be jarred awake, can help. What he can do is to say, “Given the data and that I used a normal distribution, and given the assumptions which provides me the same numerical answers as the frequentist, I can say that I don’t know the precise value of the pair—the pair, I say—of (m,s), I can take my uncertainty of them into account to answer (1).”

What this now-modern Bayesian does is to say (m,s) = (m-value 1, s-value 1) with some probability, that (m,s) = (m-value 2, m-value 2) with some probability, and so on for each possible value that (m,s) can take. He knows these from the credible intervals he just calculated. Now for each of these values, he plugs in the guess of (m,s) and calculates (1). Then he takes all the possible values of (1) and weights them by the probability (m,s) take each of these values. In the end he produces

     (3) Pr (Distance > 1 meter | normal and past data) = the answer.

There is no more talk of m and s, which are of no interest to anybody, most specifically the customer. There is only the answer to the question the customer wanted. Notice that this answer is still conditional on the “model”, the normal distribution. It is also conditional on the past data, which is no surprise.

But this means that if originally assumed the premise, “Our uncertainty in the distance is quantified by a gamma distribution” the answer to (3) will be different. Just as it would be different if we began with a Weibull (say) or any other mathematical probability distribution.

Which probability distribution is the “right” one? Well, that is a conclusion to a probability argument. Which premises will we supply to ascertain the probability that that normal, or gamma, or whatever, is the “right” one? That again is up to us. We’ll talk more about this in detail at another time. But for now first suppose we have the evidence/premises, “I have three probability models, normal, gamma, and Weibull. Just one of these is the right one to quantify uncertainty in distance.” Given just this information, the probability that any is right is 1/3.

We could then take this information and compute a (3) for each model, then weight the three answers (the three numerical answers to (3)) to produce this

     (4) Pr (Distance > 1 meter | assumptions about distributions and past data) = better answer.

Notice that there is no talk about which distributions make up (4). They disappeared just as the m and s disappeared when we went from (1) to (3).

The point: every statistical problem the modern Bayesian does is just like this. He attempts to answer the actual questions real customers ask him.

Homework

Check for typos.

Wine tour today.

Also, have your spreadsheets ready for tomorrow.

Another Try With A New Look

Unless there is a general revolt, this is it. Tweaks can of course be made—fonts darkened or lightened, background colors shaded, some widgets shifted. But this is it.

One of the big reasons I had to switch is because the old format was difficult to use on phones, tablets, and the like. This one looks fantastic on my HTC, and from what I can tell, soars on iPads. It is also swell on screen. And all is automatic. I mean, there shouldn’t have to be any “pinching” or “tapping” to have the words show properly,

There certainly isn’t anything fancy about this theme, but then we don’t really do fancy. Focus is still on the words and the occasional graphic.

I want to put a guide at the bottom of the comments to show the allowable tags. The blockquote is annoyingly always italicized. I don’t like while in a post the arrows to previous and more recent posts. The images on the right bar with rounded edges have the wrong color for their edges. Things like that. They’ll get fixed.

This would have all been done earlier, but the class is taking all my time.

Speaking of the class, has anybody collected any data?

Teaching Journal: Day 5

Let’s make sure we grasped yesterday’s lesson. Emails and comments suggest we have not. These concepts are hardest for those who have only had classical training.

We want to know something like this: what is the probability the boule will land at least 1 meter from the cochonette? Notice that this is an observable, measurable, tangible question. A natural question, immediately understandable, not requiring a degree in statistics to comprehend. Of course, it needn’t be “1 meter”, it could be “2 meters” or 3 or any number which is of interest to us.

Now, as the rules of logic admit, I could just assume-for-the-sake-of-argument premises which specify a probability distribution for the distance the boule will be from the cochonette. Or I could assume the uncertainty in this distance is quantified by a normal distribution. Why not? Everybody uses these creatures, right or wrong. We may as well, too.

A normal distribution requires two parameters, m and s. They are NOT, I emphasize again, the “mean” and “standard deviation.” They are just two parameters which, when given, fully specify the normal and let us make calculations. The mean and standard deviations are instead functions of data. Everybody knows what the mean function looks like (add all the numbers, divide by the number of numbers). It isn’t of the slightest interest to us what the standard deviation function is. If you want to know, search for it.

Since I wanted to use a normal—and this is just a premise I assumed—I repeat and you should memorize that this is just a premise I assumed—since, I say, I want to use a normal, I must specify m and s. There is nothing in the world wrong with also assuming values for these parameters. After all (you just memorized this), I just assumed the normal and I am getting good as assuming.

With m and s in hand, I can calculate this:

     (1) Pr (Distance > 1 meter | normal with m and s specified) = something

The “something” will depend on the m and s I choose. If I choose different m and s then the “something” will change. Obviously.

The question now becomes: what do statisticians do? They keep the arbitrary premise “The normal quantifies my uncertainty in the distance” but then add to it these premises, “I observed in game 1 the distance D1. In game 2 I observed the distance D2 and so on.”

These “observational” premises are uninteresting by themselves. They are not useful, unless we add to them the premise, the quite arbitrary premise, “I use these observations to estimate m and s via the mean and standard deviation.” This is all we need to answer (1). That is, we needed a normal distribution with the m and s specified and any way we guess m and s give us values for m and s (right?). It matters naught to (1) how m and s are specified. But without the m and s specified, (1) CANNOT be calculated. Notice the capitals.

Here is what the frequentist will do. She will calculate the mean (and standard deviation; but ignore this) and then report the “95% confidence interval” for this guess. We saw yesterday the interpretation of this strange object. But never mind that today. The point is the frequentist statistician ignores equation (1) and instead answers a question that was not asked. She contents herself with saying “The mean of the distances was this number; the confidence interval is this and such.”

And this quirky behavior is accepted by the customer. He forgets he wanted to know (1) or assumes the statement he just received is a sort of approximate answer to (1). Very well.

Here is what the classical Bayesian will do. The same thing as the frequentist. In this case, at least. The calculations the Bayesian does and the calculation the frequentist does, though they begin at different starting points, end at the same place.

The classical Bayesian will also compute the mean and he will also say “The mean is my best guess for m.” And he will also compute the exact same confidence interval but he will instead call it a credible interval. And this in fact represents a modest improvement, even though the numbers of the interval are identical. It is an improvement because the classical Bayesian can then say things like this, “There is a 95% chance the true value of m lies inside the credible interval” whereas the frequentist can only repeat the curious tongue twister we noted yesterday.

The classical Bayesian, proud of this improvement and pleased the numbers match his frequentist sister’s, also forgets (1). Ah well, we can’t have everything.

There is one more small thing. The classical Bayesian also recognizes that his numbers will not always match his frequentist sister’s. If for instance the frequentist and classical Bayesian attack a “binomial” problem, the numbers won’t match. But when normal distributions are used, as they were here and as they are in ordinary linear regression, statisticians are one big happy family. And isn’t that all that matters?

No.

Homework

You should have been collecting your data by now. If not, start. We’ll only be doing ordinary linear regression according to the modern slogan: Regression Right Or Wrong!

« Older posts Newer posts »

© 2014 William M. Briggs

Theme by Anders NorenUp ↑