William M. Briggs

Statistician to the Stars!

Page 145 of 410

Why Do Statisticians Answer Silly Questions That No One Ever Asks?

Julian Champkin, editor of Significance magazine somehow came across the percipient insights of yours truly and asked me to write l’article controversé. Which I did. And with gusto. Champkin, a perspicacious individual with the insight and experience of one long accustomed to the peculiarities and peccadillos of publishing, added the word “silly” to the title. I find myself not objecting.

The article is here. I beg forgiveness that reading the piece requires a subscription (yours or an institution’s).

Statisticians are, in actual fact, as an Englishman would put it, Significance being an organ of the Royal Statistical Society, in the bad habit of answering questions in which nobody has the slightest interest. More rottenness is put forth in the name of Science because of the twisted cogitations of statisticians than because of any other cause.

The problem is that the questions statisticians answer are not the questions civilians put to us. But the poor trusting saps who come to us, on seeing the diplomas on our walls and upon viewing the perplexing mathematics in which we couch our responses, go away intimidated and convinced that what we have told them are the answers to their queries. They can’t, then, be blamed for writing results as if they had received the One Final Word.

There are many reasons why we lead our flocks astray, but the main culprit is we instill a sort of scientific cockiness. A civilian appears and asks, “How much more likely is drug B than drug A at curing this disease?” We do not answer this. We instead tell him which drug, in the opinion of our theory, is “better”, imputing a certainty to our pronouncement which is unwarranted.

We’re tired of these examples, but they are paradigmatic. It is through the wiles of statistics that sociologists can “conclude” that those who either watch a 4th of July parade or who see, oh so briefly, a miniature picture of the American flag can turn one into a Republican.

The old ways of statistics allowed over-certainty in the face of small samples sizes. The new ways of doing statistics (now not always called statistics, but perhaps artificial intelligence, data mining, and machine learning) allows over-the-top surety in the face of large sample sizes, a.k.a. Big Data. The difference being that the later methods are automated, while the former are hands-on. False beliefs can now be generated at a much faster rate, so some progress is being made.

If you followed last week’s “Let’s try this again” on temperatures, you’ll have an idea what I mean about over-certainty (incidentally, due to time constraints, I will not be able to answer questions posted there until tomorrow). Also click the Start Here tab at the top of this page and look under the various articles under Statistics.

Update Posting date change to allow more comments.

Teaching Journal: Day 11—Rewrite, Red Wine, Hat Clips

We started by learning that probability is hard and not always quantifiable. For instance, I imagine many of you would have judged it more likely than not that the Supreme Court would have invalidated at least the “mandate” portion of Obamacare. Clearly, many of us had the wrong premises.

Just as many of us have new incorrect premises about what the Roberts’ ruling means. As to that, follow my Twitter stream from yesterday (see the panel to the right), where I sequentially pull out what I think are relevant quotes from the ruling. However, this is a topic for another time.

I’ve taught this class at Cornell for several years. The in-class version is, naturally, quite different than the on-screen presentation, so don’t be misled by what you’ve seen here the past fortnight. After many (but not infinite) repetitions teaching, I have learned three things.

Lesson One: try not to do too much. The book/class notes was originally designed for a semester-length course for undergraduates. It works for that, but not as well for a class meant to teach fundamentals of analysis. For instance, I could leave out all the stuff on counting and how to build a binomial distribution. Basic probability rules might stay, because they’re easy and useful. But to even learn to compute the chance of winning the (say) Mega Millions requires learning some basic combinatorics. Can’t have it all, though.

So I think it better to stop after Bayes’s rule and present everything else as a “given”, like I do with the ubiquitous, and ubiquitously inappropriate, normal distribution. This would allow more time to discuss ideas instead of mechanics. More time to cover how things go wrong, and why there is so much over-certainty produced using classical statistical methods.

This means a re-write of the book/notes is in order. Which I have been doing slowly, but now must finish to get it ready in time for next year.

Lesson Two: red wine does not go with white linen.

Lesson Three: there are churches left that still have hat clips in the backs of pews. I had tremendous fun snapping these during mass at Divine Child when I was a kid, usually during homilies. Just for the sake of the good old days, I did so last weekend. But only once. It might have been twice.

Teaching Journal: Day 9—Hypothesis Testing: Part II

A review. We have sales data from two campaigns, A and B, data in which we choose (as a premise) to quantify our uncertainty with normal distributions. We assume the “null” hypothesis that the parameters of these two distributions are equal: mA = mB or sA = sB. This says that our uncertainty in sales at A or B is identical. It does not say that A and B are “the same” or “there is no difference” in A and B.

All that is step one of hypothesis testing. Now step two: choose a “test statistic.” This is any function of the data you like. The most popular, in this situation, is some form of the “t-statistic” (there is more than one form). Call our statistic “t”. But you are free to choose one of many, or even make up your own. There is nothing in hypothesis testing theory which requires picking this and not that statistic.

Incidentally, there are practical (and legal) implications over this free choice of test statistic. See this old post for how different test statistics for the same problem were compared in the Wall Street Journal.

Finally, calculate this object:

     (4) Pr( |T| > |t|   | “null” true, normals, data, statistic)

This is the p-value. In words, it is the probability of seeing a test statistic (T) larger (in absolute value) than the test statistic we actually saw (t) in infinite repetitions of the “experiment” that gave rise to our data, given the “null” hypothesis is exactly true, that normal distributions are the right choice, the actual data we saw, and the statistic we used.

There is no way to make this definition pithy—without sacrificing accuracy. Which most do: sacrifice accuracy, that is. Although it does a reasonable job, Wikipedia, for instance, says, “In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.” This leaves out the crucial infinite repetitions, and the premises of the distribution and test statistic we used. In frequentist definitions of probability, it is always infinity or bust. Probabilities just do not exist for unique or finite events (of course, people always assume that these probabilities exist; but that is because they are natural Bayesians).

Now there has developed a traditional that whenever (4) is less than the magic number, by an act of sheer will, you announce “I reject the null hypothesis,” which is logically equivalent to saying, “I claim that mA does not equal mB” (let’s, as nearly everybody does, just ignore sA and sB).

The magic, never-to-be-questioned number is 0.05, chosen, apparently, by God Himself. If (4) is less than 0.05 you are allowed to claim “statistical significance.” This term means only that (4) is less than 0.05—and nothing else.

There is no theory which claims that 0.05 is best, or that links the size of (4) with the rejection of the “null.” Before we get to that, understand that if (4) is larger than the magic number you must announce, “I fail to reject the ‘null’” but you must never say, “I accept the ‘null.’” This contortion is derived from R.A. Fisher’s love of Karl Popper’s “falsifiability” ideas, ideas which regular readers will recall no longer have any champions among philosophers.

This “failing to reject” is just as much an act of will as “rejecting the ‘null’” was when (4) was less than 0.05. Consider: if I say, as I certainly may say, “mA does not equal mB” I am adding a premise to my list, but this is just as much an act of my will as adding the normal etc. was. (4) is not evidence that “mA does not equal mB“. That is, given (4) the probability “mA does not equal mB” cannot be computed. In fact, it is forbidden (in frequentist theory) to even attempt to calculate this probability. Let’s be clear. We are not allowed to even write

     (5) Pr ( “mA does not equal mB” | (4) ) = verboten!

This logically implies, and it is true, that the size of (4) has no relation whatsoever to the proposition “mA does not equal mB.” (See this paper for formal proofs of this.) This is what makes it an act of will that we either declare “mA does not equal mB” or “mA equals mB.”

But, really, why would we want to compute (5) anyway? The customer really wants to know

     (6) Pr ( B continuing better than A | data ).

There is nothing in there about unobservable parameters or test statistics, and why should there be? We learn to answer (6) later.

But before we go, let me remind you that we have only begun criticisms of p-values and hypothesis testing. There are lists upon lists of objections. Before you defend p-values, please read through this list of quotations.

Teaching Journal: Day 8—Hypothesis Testing: Part I

Hypothesis testing nicely encapsulates all that is wrong with frequentist statistics. It is a procedure which hides the most controversial assumption/premise. It operates under a “null” belief which nobody believes. It is highly ad hoc and blatantly subjective. It incorporates magic p-values. And it ends all with a pure act of will.

Here is how it works. Imagine (no need, actually: go to the book page and download the advertising.csv file and follow along; to learn to use R, read the book, also free) you have run two advertising campaigns A and B and are interested in weekly sales under these two campaigns. I rely on you to extend this example to other areas. I mean, this one is simple and completely general. Do not fixate on the idea of “advertising.” This explanation works equally well on any comparison.

I want to make the decision which campaign, A or B, to use country-wide and I want to base this decision on 20 weeks of data where I ran both campaigns and collected sales (why 20? it could have been any number, even 1; although frequentist hypothesis testing won’t work with just one observation each; another rank failure or the theory).

Now I could make the rule that whichever campaign had higher median sales is the better. This was B. I could have also made the rule that whichever campaign had higher third-quartile sales is better. This was A. Which is “better” is not a statistical question. It is up to you and the relates to the decisions you will make. So I could also rule that whichever had the higher mean sales was better. This was B. I could have made reference, too, to a fixed number of sales, say 500. Whichever had a greater percentage of sales greater than 500 was “better.” Or whatever else made sense to the bottom line.

Anyway, point is, if all I did was to look at the old data and make direct decisions, I do not need probability or statistics (frequentist or Bayesian). I could just look at the old data and make whatever decision I like.

But that action comes with the implicit premise that “Whatever happened in the past will certainly happen in likewise characteristic in the future.” If do not want to make this premise, and usually I don’t, I then need to invoke probability and ask something like, “Given the data I observed, what is the probability that B will continue to be better than A” if by “better” I mean higher median or mean sales. Or “A will continue to be better” if by “better” I meant higher third-quartile sales.” Or whatever other question makes sense to me about the observable data.

Hypothesis testing (nearly) always by assuming that we can quantify our uncertainty in the outcome (here, sales) with normal distributions. When I say “(nearly) always” I mean statistics as she is actually practiced. This “normality” is a mighty big assumption. It is usually false on the premise that, like here, sales cannot be less than 0. Often sociologists and the like ask questions which force answers from “1 to 5″ (which they magnificently call a “Likert scale”). Two (or more) groups will answer a question, and the uncertainty in the mean of each group is assumed to follow a normal distribution. This is usually wildly false, given that, as we have just said, the numbers cannot be smaller than 1 nor larger than 5.

Normal distributions, then, are often wrong, and often wrong by quite a lot. (And if you don’t yet believe this, I’ll prove it with real data later.) This says that hypothesis testing starts badly. But ignore this badness, or the chance of it, like (nearly) everybody else does and let’s push on.

If it is accepted that our uncertainty in A is quantified by a normal distribution with parameters mA and sA, and similarly B with mB and mB, then the “null” hypothesis is that mA = mB and (usually, but quietly) sA = sB.

Stare at this and be sure to understand what it implies. It DOES NOT say that “A and B are the same.” It says our uncertainty in A and B is the same. This is quite, quite different. Obviously—as in obviously—A and B are not the same. If they were the same we could not tell them apart. This is not, as you might think, a minor objection. Far from it.

Suppose it were true that as the “null” says mA = mB (exactly, precisely equal). Now if sA were not equal to sB, then our uncertainty in A and B can be very different. It could be, depending on the exact values of sA and sB, that the probability of higher sales under A was larger than B, or the opposite could also be true. Stop and understand this.

Just saying something about the central parameters m does not tell us enough, not nearly enough. We need to know what is going on with all four parameters. This is why if we assume that mA = mB we must also assume that sA = sB.

The kicker is that we can never know whether mA = mB or sA = sB; no, not even for a Bayesian. These are unobservable, metaphysical parameters. This means they are unobservable. As in “cannot be seen.” So what do we do? Stick around and discover.

Teaching Journal: Day 7

The joke is old and hoary and so well known that I risk the reader’s ire for repeating it. But it contains a damning truth.

Most academic statistical studies are like a drunk searching for his keys under a streetlight. He looks there not because that is where he lost his keys, but because that is where the light is.

To prove this comes these four quotations from Jacqueline Stevens, professor of political science at Northwestern University (original source):

In 2011 Lars-Erik Cederman, Nils B. Weidmann and Kristian Skrede Gleditsch wrote in the American Political Science Review that “rejecting ‘messy’ factors, like grievances and inequalities,” which are hard to quantify, “may lead to more elegant models that can be more easily tested…”

Professor Tetlock’s main finding? Chimps randomly throwing darts at the possible outcomes would have done almost as well as the experts…

Research aimed at political prediction is doomed to fail. At least if the idea is to predict more accurately than a dart-throwing chimp…

I look forward to seeing what happens to my discipline and politics more generally once we stop mistaking probability studies and statistical significance for knowledge.

If our only evidence is that “Some countries which face economic injustice go to war and Country A is a country which faces economic injustice” then given this the probability that “Country A goes to war” is some number between 0 and 1. And not only is this the best we can do, but it is all we can do. It becomes worse when we realize the vagueness of the term “economic injustice.”

I mean, if we cannot even agree on the implicit (there, but hidden) premise “Economic injustice is unambiguously defined as this and such” we might not even be sure that Country A actually suffers economic injustice.

But supposing we really want to search for the answer to the probability that “Country A goes to war”, what we should not do is to substitute quantitative proxies just to get some equations to spit out numbers. This is no different than a drunk searching under the streetlight.

The mistake is in thinking that not only that all probabilities are quantifiable (which they are not), but that all probabilities should be quantified, which leads to false certainty. And bad predictions.

Incidentally, Stevens also said, “Many of today‚Äôs peer-reviewed studies offer trivial confirmations of the obvious and policy documents filled with egregious, dangerous errors.”

Modeling, which we begin today in a formal sense, is no different than what we have been doing up until now: identifying propositions which we want to quantify the uncertainty of, then identifying premises which are probative of this “conclusion.” As the cautionary tale by Stevens indicates, we must not seek quantification just for the sake of quantification. That is the fundamental error.

A secondary error we saw developed at the end of last week: substituting knowledge about parameters of probability models as knowledge of the “conclusions.” This error is doubled when we realize that the probability models should often not be quantified in the first place. We end up with twice the overconfidence.

Now, if our model and data are that “Most Martians wear hats and George is a Martian” the probability of “George wears a hat” is greater than 1/2 but less than 1. That is the best we can do. And even that relies on the implicit assumption about the meaning of the English word “Most” (of course, there are other implicit assumptions, including definitions of the other words and knowledge of the rules of logic).

This ambiguity—the answer is a very wide interval—is intolerable to many, which is why probability has come to seem subjective to some and why others will quite arbitrarily insert and quantifiable probability model in place of “Most…”

It’s true that both these groups are free to add to the premises such that probabilities of the conclusions do become hard-and-fast numbers. We are all free to add any premises we like. But this makes the models worse in the sense that they match reality at a rate far less than the more parsimonious premises. That, however, is a topic for another day.

Homework

Read about all this. More is to come. In another hurry today. Get your data in hand by end of the day. Look for typos.

« Older posts Newer posts »

© 2014 William M. Briggs

Theme by Anders NorenUp ↑