William M. Briggs

Statistician to the Stars!

Category: Statistics (page 1 of 179)

The general theory, methods, and philosophy of the Science of Guessing What Is.

Drs Howard, Fine, and Howard check the result of a statistical model.

Our friend Christos Argyropoulos (‏@ChristosArgyrop) to a popular medical site in which Stephen Reznick asks “Keep statistics simple for primary care doctors.

He can’t read the journals, because why? Because “Medical school was a four year program. The statistics course was a brief three week interlude in the midst of a tsunami of new educational material presented in a new language…While internship and residency included a regular journal club, there was little attention paid to analyzing a paper critically from a statistical mathematical viewpoint.”

Reznick has been practicing for some time and admits his “statistical analysis skills have grown rusty…When the Medical Knowledge Self Assessment syllabus arrives every other year, the statistics booklet is probably one of the last we look at because not only does it involve re-learning material but you must first re–learn a vocabulary you do not use day to day or week to week.”

What he’d like is for journals to “Let authors and reviewers say what they mean at an understandable level.”

Now I’ve taught and explained statistics to residents, docs fresh out of med school, for a long time. And few to none of them remember the statistics they were taught either. Why should they? Trying to squeeze a chi-square test among all those muscles and blood vessels they must memorize isn’t easy, and not so rewarding either.

Medical students learn why the ankle bone is connected to the ulniuous, or whatever the hell it is, and what happens when this or that artery is choked off. Useful stuff—and all to do with causality. They never learn why the chi-square does what it does. It is presented as mystery, a formula or incantation to invoke when the data take such-and-such a form. Worse, the chi-square and all other tests have nothing to do with causality.

A physician reading a journal article about some new procedure asks himself questions like, “What is the chance this would work for patients like mine?”, or “If I give my patients this drug, what are the chances he gets better?”, or “How does the cure for this disease work?” All good, practical, commonsense queries.

But classical statistics isn’t designed to answer commonsense questions. In place of clarity, we have the “null” and “alternate” hypotheses, which in the end are nothing but measures of model fit (to the data at hand and none other). Wee p-values are strewn around papers like fairy dust. What causes what cannot be discovered, but readers are invited to believe what the author believes caused the data.

I’ve beat this drum a hundred times, but what statistical models should do is to predict what will happen, given or conditioned on the data which came before and the premises which led to the particular model used. Then, since we have a prediction, we wait for confirmatory, never-observed-before data. If the model was good, we will have skillful predictions. If not, we start over.

“But, Briggs, that way sounds like it will take longer.”

True, it will. Think of it like the engineering approach to statistics. We don’t rely on theory and subjectively chosen models to build bridges or aircraft, right? We project and test. Why should we trust our health to models which have never been been put through the fire?

One benefit would be a shoring up of the uncertainty of side effects, especially the long-term side effects, of new drugs. Have you seen the list of what can go wrong when you eat one of these modern marvels? Is it only us civilians who cringe when hearing “suicide” is a definite risk of an anti-depressant? Dude. Ask your doctor if the risk of killing yourself is right for you.

What the patient wants to know is something like, “If I eat this pill, what are the chances I’ll stroke out?” The answer “Don’t worry” is insufficient. Or should be. How many medicines are released only to be recalled because a particular side effect turned out more harmful than anticipated?

“Wouldn’t your scheme be difficult to implement?”

It’s a little known but open secret that every statistical model in use logically implies a prediction of new data. All we have to do is use the models we have in that way. This would allow us to spend less time talking about model fit and more about the consequences of particular things.

“What are the chances people will switch to this method?”

Slim.

Not a simulation.

Introit

Ever heard of somebody “simulating” normal “random” or “stochastic” variables, or perhaps “drawing” from a normal or some other distribution? Such things form the backbone of many statistical methods, including bootstrapping, Gibbs sampling, Markov Chain Monte Carlo (MCMC), and several others.

Well, it’s both right and wrong—but more wrong than right. It’s wrong in the sense that it encourages magical thinking, confuses causality, and is an inefficient use of time. It’s right that, if assiduously applied, reasonably accurate answers from these algorithms can be had.

Way it’s said to work is that “random” or “stochastic” numbers are input into some algorithm and out pops answers to some statistical question which is not analytic, which, that is, cannot be solved by pencil and paper (or could, but at too seemingly great a difficulty).

For example, one popular way of “generating normals” is to use what’s called a Box-Muller transformation. It starts by “generating” two “random” “independent” “uniform” numbers U1 and U2 and then calculating this creature:

$Z = R \cos(\Theta) =\sqrt{-2 \ln U_1} \cos(2 \pi U_2)$,

where Z is now said to be “standard normally distributed.” Don’t worry if you don’t follow the math, though try because we need it for later. Point is that any algorithm which needs “normals” can use this procedure.

Look at all those scare quotes! Yet each of them is proper and indicates an instance of magical thinking, a legacy of our (frequentist) past which imagined aleatory ghosts in the machines of nature, ghosts which even haunt modern Bayesians.

Scare quotes

First, random or stochastic means unknown, and nothing more. The outcome of a coin flip is random, i.e. unknown, because you don’t know all the causes at work upon the spinning object. It is not “random” because “chance” somehow grabs the coin, has its way with it, and then deposits the coin into your hand. Randomness and chance are not causes. They are not real objects. The outcome is determined by physical forces and that’s it.

Second, there is the unfortunate, spooky tendency in probability and statistics to assume that “randomness” somehow blesses results. Nobody knows how it works; that’s why it’s magic. Yet how can unknowingness influence anything if it isn’t an ontological cause? It can’t. Yet it is felt that if the data being input to algorithms aren’t “random” then the results aren’t legitimate. This is false, but it accounts for why simulations are so often sought.

Third, since randomness is not a cause, we cannot “generate” “random” numbers in the mystical sense implied above. We can, of course, make up numbers which are unknown to some people. I’m thinking of a number between 32 and 1400: to you, the number is random, “generated”, i.e. caused, by my feverish brain. (The number is hidden in the source code of this page, incidentally.)

Fourth, there are no such thing as “uniforms”, “normals”, or any other distribution-entities. No thing in the world is “distributed uniformly” or “distributed normally” or distributed anything. Distributed-as talk is more magical thinking. To say “X is normal” is to ascribe to X a hidden power to be “normal” (or “uniform” or whatever). It is to say that magical random occult forces exist which cause X to be “normal,” that X somehow knows the values it can take and with what frequency.

This is false. The only thing we are privileged to say is things like this: “Give this-and-such set of premises, the probability X takes this value equals that”, where “that” is calculated via some distribution implied by the premises. (Ignore that the probability X takes any value for continuous distributions is always 0.) Probability is a matter of ascribable or quantifiable uncertainty, a logical relation between accepted premises and some specified proposition, and nothing more.

Practicum

Fifth, since this is what probability is, computers cannot “generate” “random” numbers. What happens, in the context of our math above, is that programmers have created algorithms which will create numbers in the interval (0,1) (notice this does not include the end points); not in a coherent way, but with reference to some complex formula. This formula which, if run long enough, will produce all the numbers between (0,1) at the resolution of the computer.

Say this is every 0.01; that is, our resolution is to the nearest hundredth. Then all the numbers 0.01, 0.02, …, 0.99 will eventually show up (many will be repeated, of course). Because they do not show up in sequence, many fool themselves into thinking the numbers are “random”, and others, wanting to hold onto the mysticism but understanding the math, call the numbers “pseudo random”, an oxymoron.

But we can sidestep all this and simply write down all the numbers in the sequence, i.e. all the numbers in (0,1)2 (since we need U1 and U2) at whatever resolution we have; this might be (0.01, 0.01), (0.01, 0.02), …, (0.99, 0.99) (this is a sequence of pairs of numbers, of length 9801). We then apply the mapping of (U1, U2) to Z as given above, which produces (3.028866, 3.010924, …, 1.414971e-01).

What it looks like is shown in the picture up top.

The upper plot are the mappings of (U1, U2) to Z, along the index of the number pairs. If you’ve understood the math above, the oscillation, size, and sign changes are obvious. Spend a few moments with this. The bottom plot shows the empirical cumulative distribution of the mapped Z (black), overlayed by the (approximate) analytic standard normal distribution (red), i.e. the true distribution to high precision.

There is tight overlap between the two, except for a slight bump or step in the ECDF at 0, owing to the crude discretization of (U1, U2). Computers can do better than the nearest hundredth. Still, the error even at this crude level is trivial. I won’t show it, but even a resolution 5 time worse (nearest 0.05; number sequence length of 361) is more than good enough for most applications (a resolution of 0.1 is pushing it).

This picture gives a straightforward, calculate-this-function analysis, with no mysticism. But it works. If what we were after was, say, “What is the probability that Z is less than -1?”, all we have to do is ask. Simple as that. There are no epistemological difficulties with the interpretation.

The built-in analytic approximation is 0.159 (this is our comparator). With the resolution of 0.01, the direct method shows 0.160, which is close enough for most practical applications. A resolution of 0.05 gives 0.166, and 0.1 gives 0.172 (I’m ignoring that we could have shifted U1 or U2 to different start points; but you get the idea).

None of these have plus or minuses, though. Given our setup (starting points for U1 and U2, the mapping function), these are the answers. There is no probability attached. But we would like to have some idea of the error of the approximation. We’re cheating here, in a way, because we know the right answer (to high degree), which we always won’t. In order to get some notion how far off that 0.160 is we’d have to do more pen-and-paper work, engaging in what might be a fair amount of numerical analysis. Of course, for many standard problems, just like in MCMC approaches, this could be worked out in advance.

MCMC etc.

Contrast this to the mystical approach. Just like before, we have to specify something like a resolution, which is the number of times we must “simulate” “normals” from a standard normal—which we then collect and form the estimate of the probability of less than -1, just as before. To make it fair, pick 9801, which is the length of the 0.01-resolution series.

I ran this “simulation” once and got 0.162; a second time 0.164; a third showed 0.152. There’s the first problem. Each run of the “simulation” gives different answers. Which is the right one? They all are; a non-satisfying but true answer. So what will happen if the “simulation” itself is iterated, say 5000 times, where each time we “simulate” 9801 “normals” and each time estimate the probability, keeping track of all 9801 estimates? Let’s see, because that is the usual procedure.

Turns out 90% of the results are between 0.153 and 0.165, with a median and mean of 0.159, which equals the right answer (to the thousandth). It’s then said there’s a 90% chance the answer we’re after is between 0.153 and 0.165. This or similar intervals are used as error bounds, which are “simulated” here but (should be) calculated mechanically above. Notice that the uncertainty in the mystical approach feels greater, because the whole process is opaque and purposely vague. The numbers seem like they’re coming out of nowhere. The uncertainty is couched probabilistically, which is distracting.

It took 19 million calculations to get us this answer, incidentally, rather than the 9801 the mechanical approach produced. But if we increased the resolution to 0.005 there, we also get 0.159 at a cost of just under 40,000 calculations. Of course, MCMC fans will discover short cuts and other optimizations to implement.

Why does the “simulation” approach work, though? It does (at some expensive) give reasonable answers. Well, if we remove the mysticism about randomness and all that, we get this picture:

Mystical versus mechanical.

The upper two plots are the results of the “simulation”, while the bottom two are the mechanical mapping. The bottom two show the empirical cumulative distribution of U1 (U2 is identical) and the subsequent ECDF of the mapped normal distribution, as before. The bump at 0 is there, but is small.

Surprise ending!

The top left ECDF shows all the “uniforms” spit out by R’s runif() function. The only real difference between this and the ECDF of the mechanical approach is that the “simulation” is at a finer resolution (the first U happened to be 0.01031144, 6 orders of magnitude finer; the U’s here are not truly plain-English uniform as they are in the mechanical approach). The subsequent ECDF of Z is also finer. The red lines are the approximate truth, as before.

But don’t forget, the “simulation” just is the mechanical approach done more often. After all, the same Box-Muller equation is used to map the “uniforms” to the “normals”. The two approaches are therefore equivalent!

Which is now no surprise: of course they should be equivalent. We could have taken the (sorted) Us from the “simulation” as if they were the mechanical grid (U1, U2) and applied the mapping, or we could have pretended the Us from the “simulation” were “random” and then applied the mapping. Either way, same answer.

The only difference (and advantage) seems to be in the built-in error guess from the “simulation”, with its consequent fuzzy interpretation. But we could have a guess of error from the mechanical algorithm, too, either by numerical analysis means as mentioned, or even by computer approximation (one way: estimate quantities using a coarse, then fine, then finest grid and measure the rate of change of the estimates; with a little analysis thrown in, this makes a fine solution).

The benefit of the mechanical approach is the demystification of the process. It focuses the mind on the math and reminds us that probability is nothing but a numerical measure of uncertainty, not a live thing which imbues “variables” with life and which by some sorcery gives meaning and authority to results.

She said, he said.

Our beneficent government, through its Department of Education’s Office of Civil Rights “sent a letter to colleges nationwide on April 4, 2011, mandating policy changes in the way schools handle sexual assault complaints, including a lowering of the burden of proof from ‘clear and convincing’ evidence to a ‘preponderance’ of evidence. Not surprisingly, there has been a marked increase in women coming forward with such complaints.”

The preponderance of evidence criterion is asinine and harmful and bound to lead to grief. Here’s why.

Suppose a woman, Miss W, instead of going to the police, shows up at one of her university’s various Offices Of Indignation1 & Diversity and complains she was “sexually assaulted” by Mr X, a fellow student. By means of a lengthy and secretive process, Mr X is called eventually to deny the claim. He does so.

Incidentally, we may as well inject here the advice that if celibacy outside marriage were promoted at colleges, while the success rate of this program would never reach 100%, any rate above 0% solves for its dedicated individuals the sorts of problems discussed below.

Anyway, ignoring all other details, here is what we have: Miss W says Mr X did it, and Mr X denies. Using only that evidence and none other, there is to the neutral observer a 50-50 chance Mr X did the deed. Fifty-fifty does not a preponderance make, which is any amount over 50%. But since we start at 50% given she-said-he-said, it takes only the merest sliver of additional evidence to push the probability beyond 50% and into preponderance.

What might that evidence be? Anything, really. A campus Diversity Tzar might add to Miss W’s claim, “Miss W almost certainly wouldn’t have made the charge if it weren’t true”, which brings the totality of guilt probability to “almost certainly” (we cannot derive a number). Or the Tzar might say, “Most men charged with this crime are guilty”, which brings the guilt probability to “nearly certain”—as long as we supply the obvious tacit premises like “Mr X is a man and is charged with this crime.”

But this is going too far, and, depending on the university, our Tzar knows she might not be able to get away with such blanket statements. Instead she might use as evidence, “Miss W was crying, and victims of this crime often or always cry”, or “Miss W told another person about Mr X’s crime, which makes it more likely she was telling me the truth as telling more than one person, if her story is a lie, would be to compound a lie.”

Now none of these are good pieces of evidence; indeed, they are circumstantial to the highest degree. But. They are not completely irrelevant premises, either. As long as we can squeeze the weest, closest-to-epsilon additional probability from them, they are enough to push the initial 50% to something greater than 50%.

And that is all we need to crush Mr X, for we have reached a preponderance of evidence. Of course, Mr X may counter or cancel this evidence with his own protestations, or even physical proof that he was nowhere near the scene in question, or that Miss W drunk-texted him first and asked for the services which she later claimed were “assault.” But the Tzar, having all the woes of all feminine society on her mind, is free to ignore any or all of all this.

Mr X, guilty or innocent, is therefore easy to “prove” guilty using this slight standard. He can then be punished in whatever way thought appropriate by the university.

That brings up another question. Suppose you gather all the relevant evidence and decide that the chance of the zombie apocalypse is just under 50%. Or again, given reliable premises you calculate the probability that the woman who just winked at you from across the bar does not have Ebola is 49.999%. You therefore decide that since the preponderance of evidence is against both propositions, you needn’t protect yourself.

You have it. The probability of 50% is in no ways the probability to use for all yes-no decisions. Decisions have consequences and these must be taken into account. Should we wreck a man when the evidence against him amounts only to 50.001%? Too, if we use in every situation the preponderance criterion, the number of mistakes made will be great.

This is why in actual criminal courts, where the standards of evidence are in play and the accused is allowed to confront his accuser and so on, the standard is guilt beyond reasonable doubt, a sane and sober principle.

—————————————————-

1The indignation quip came from this.

Was it Justice Ginsberg who popularized the fallacy that statistics could prove discrimination? Somebody check me on that. Busy day here.

The fallacy is that statistical models which have “statistically significant” findings identify causes. Which they sometimes can, in an informal way, but probably don’t. Anyway, that’s for another day. Point now is that discovering “disparities” and “gaps” and “discrimination” via statistics is silly.

Headline from the New York Post: Goldman Sach’s differentiating stats dominate sex suit.

Standard story. Couple of dissatisfied women accused Goldman Sach’s of being boy friendly. “They are suing the financial powerhouse, alleging a pattern of underpaying women and promoting men over them.” The ladies’ lawyer and Sach’s each hired their own statistician. The fallacy is already in place, ready to be called upon.

If there was real discrimination against women because they were women what should happen is that it should be proved. How? Interviews with employees, managers, ex-employees, examination of emails, memos, that sort of thing. Hard work, which, given the nature of human interactions, may ultimately be ambiguous, useless to prove anything.

Statistics certainly can’t prove discrimination, because statistics don’t identify causes. And it’s what causes the alleged discrimination that is the point in question. Since statistics can’t answer that question—which everybody should know—why would anybody ever use it?

Laziness, for one. Who wants to do all that other work? For two, it’s easy to get people to accept “discrimination” happened because math. Lawyers working on commission therefore love statistics.

The Post said, “The bank’s expert, Michael Ward of Welch Consulting, said the pay disparities between men and women are statistically insignificant and said Farber [the ladies' expert] was overly broad in his analysis.” “Significance” and “insignificance” are model and test dependent, so it’s easy for one expert to say “insignificant” and another “significant.” The data can “prove” both conclusions.

But something else is going on here, I think. Note that according to Farber, “Female vice presidents at Goldman made an average of 24 percent less than their male counterparts”. Ah, means. An easily abused statistic.

There more. Here’s the final two paragraphs (ellipsis original). See if you can spot the probable error. Hint: the mistake, if there is one, appears to be Farber’s. Of course, there might be no error at all. We’re just guessing.

Farber looked at divisions across the bank, rather than at smaller business units, which, according to Ward, muddied the statistical data.

“Breaking it up into these little pieces means you just won’t find these pay gaps,” Farber said. “It’s always a trade-off in this kind of analysis in getting lost in the trees…or saying something about the forest.”

Get it? Take a moment and think before reading further. It’s more fun for you to figure it out than for me to tell you.

This sentence is to fill space so you don’t easily see the answer.

So is this one.

This might be a case of Simpson’s (so-called) paradox. This happens when data looked at in the aggregate, such as mean pay for men and women at the Division level, shows (say) men with higher means, but when the same data is examined at finer levels like business units, it can show women with higher or the same means as men in each unit (or a mix).

The reason this happens is that the percent of men and women isn’t be the same inside each of the finer levels, and the mean pay differ by levels (no surprise). This link shows some easy examples. It’s more common than you’d think.

Farber looked at aggregates and Ward (properly) examined smaller units, a practice which Farber calls “muddying” the data. Well, it’s a strategy. The name-calling, I mean. Judges looking for an excuse appreciate it.

Even though it’s still looking at statistics, and to be discouraged, it’s better to look at the entire pay distributions, not just means, and at even finer levels, say business units and various years of experience in the same job title. But it’s chasing fairies. It can never prove anything.

And even if the data show a difference everywhere it could be that women in each unit are paid less, but not because they are women, but because women in negotiating their “packages” might do so inefficiently compared to men. Who knows?

Statistics is no substitute for hard work.

The official numbers

When does more crime happen, in winter or summer? Why? Too easy. How about this one: according to the FBI, what was the violent crime rate over time? No need to guess. It’s pictured above. The per capita all violent crime percent from 1960 to 2012 (the last year available). Looks to be coming down some since 1991, wouldn’t you say? (The plots for other crime types, including gun crimes, all have the same general shape.)

Say, isn’t the time range of this plot the period where the our-of-control global warming “climate catastrophe” began in earnest? Let’s look at what NOAA’s GISS says:

The official numbers

I’m not in the least interested in arguing about this data; for the sake of argument, let’s just accept it as it is. Look, however, only at the black dots, which are the actual data. The red line is a smoother, i.e. a model, and is not what happened. The model is not the data! Don’t smooth your time series data! (Look here and here for why.)

Let’s tie it all together. Does it look to you like climate change is “correlated” with the violent crime rate? If you’re Chris Mooney or an academic hot for a sensational paper or a member of the media anxious to signal your cooperation with government, you must say yes. Us ordinary folk, not addled by ideology, will say no.

The Washington Post put up yet another fantasy of Mooney’s entitled “There’s a surprisingly strong link between climate change and violence“. I don’t mean to be snarky, I really don’t. But this guy routinely provokes me beyond my ability to resist. May the Lord forgive me.

Mooney cites some new meta analysis, a study I’ll dissect in due course, “of the existing research examining the relationship between climate change and violence and conflict.” Here’s the meat:

Climate variables considered in these papers included temperature increases as well as drought and rainfall changes. Conflict was analyzed in terms of clashes between individuals (like fistfights) and fights between groups (like wars). After taking it all in, the authors found compelling evidence of a link between changes in temperature and increases in conflict, noting that “deviations from moderate temperatures and precipitation patterns systematically increase the risk of conflict, often substantially, with average effects that are highly statistically significant.” Bottom line: In an ever warming world, expect more wars, civil unrest, and strife, and also more violent crime in general.

Yes, that makes sense. A statistical model which analyzes simultaneously fist fights and wars. Almost as sensible as measuring how eight-year-olds spend their allowance and the machinations of the World Bank. Hey! It’s science!

The lesson is: never ever not ever never never believe a meta analysis at its face value. It is one of the most abused statistical techniques. Smoothing time series data is another. Never mind.

Mooney gets one thing partly right when he asks, “Why do hotter temperatures produce more violence?” The obvious answer—as long as we factor out all modern wars, many of which inconveniently occur in winter; in olden days, winter made it difficult to fight; who could have guessed?—is the one we started this post with. People are out in the summer’s long warm days, and inside in the winter’s short cold days. Easy.

Yet not so easy for Mooney and for academics for whom the obvious is never good enough.

Now I would have ignored the article, putting it down as yet another attempt to prove our lying eyes aren’t seeing what they’re seeing (the two graphs above). But Mooney had to go and mention baseball. (I’m a Tigers fan. I don’t want to talk about it.) Mooney thinks a paper he uncovered is terrific proof that climate change makes us more violent.

He quotes from the awful peer-reviewed paper “Temper, Temperature, and Temptation: Heat-Related Retaliation in Baseball” in Psychological Science (2011; 22(4) 423­–428) by Richard P. Larrick and some others. Larrick checked whether increasing temperatures were associated with more beanballs. The authors admitted they were not.

So, their theory busted but still desiring a paper, the authors had to try something else. How about retaliation? Do increasing temperatures cause more? Mooney shows a graph from their paper which is so silly that I refuse to picture it. He presents this graph, as do the authors, as if it were data. Which it is not. It is the output from a preposterously complex regression model (they “control” for 13 things!).

Baseball fans: when do more beanballs, and hence more retaliations take place, in chilly April when the season has just begun and all are of good cheer, or late in hot August when tempers are up and when games start to feel a lot more crucial? Is the observed discrepancy therefore caused by climate change?

Good grief, what a rotten paper, what a rotten theory.