William M. Briggs

Statistician to the Stars!

Page 152 of 582

How To Mislead With P-values: Logistic Regression Example

Today’s evidence is not new; is, in fact, well known. Well, make that just plain known. It’s learned and then forgotten, dismissed. Everybody knows about these kinds of mistakes, but everybody is sure they never happen to them. They’re too careful; they’re experts; they care.

It’s too easy to generate “significant” answers which are anything but significant. Here’s yet more—how much do you need!—proof. The pictures below show how easy it is to falsely generate “significance” by the simple trick of adding “independent” or “control variables” to logistic regression models, something which everybody does.

Let’s begin!

Recall our series on selling fear and the difference between absolute and relative risk, and how easy it is to scream, “But what about the children!” using classical techniques. (Read that link for a definition of a p-value.) We anchored on EPA’s thinking that an “excess” probability of catching some malady when exposed to something regulatable of around 1 in 10 thousand is frightening. For our fun below, be generous and double it.

Suppose the probability of having the malady is the same for exposed and not exposed people—in other words, knowing people were exposed does not change our judgment that they’ll develop the malady—and answer this question: what should any good statistical method do? State with reasonable certainty there aren’t different chances of infection between being exposed and not exposed groups, that’s what.

Frequentist methods won’t do this because they never state the probability of any hypothesis. They instead answer a question nobody asked, about some the values of (functions of) parameters in experiments nobody ran. In other words, they give p-values. Find one less than the magic number and your hypothesis is believed true—in effect and by would-be regulators.

Logistic regression

Logistic regression is a common method to identify whether exposure is “statistically significant”. Readers interested in the formalities should look at the footnotes in the above-linked series. Idea is simple enough: data showing whether people have the malady or not and whether they were exposed or not is fed into the model. If the parameter associated with exposure has a wee p-value, then exposure is believed to be trouble.

So, given our assumption that the probability of having the malady is identical in both groups, a logistic regression fed data consonant with our assumption shouldn’t show wee p-values. And the model won’t, most of the time. But it can be fooled into doing so, and easily. Here’s how.

Not just exposed/not-exposed data is input to these models, but “controls” are, too; sometimes called “independent” or “control variables.” These are things which might affect the chance of developing the malady. Age, sex, weight or BMI, smoking status, prior medical history, education, and on and on. Indeed models which don’t use controls aren’t considered terribly scientific.

Let’s control for things in our model, using the same data consonant with probabilities (of having the malady) the same in both groups. The model should show the same non-statistically significant p-value for the exposure parameter, right? Well, it won’t. The p-value for exposure will on average become wee-er (yes, wee-er). Add in a second control and the exposure p-value becomes wee-er still. Keep going and eventually you have a “statistically significant” model which “proves” exposure’s evil effects. Nice, right?


Take a gander at this:

Figure 1

Figure 1

Follow me closely. The solid curve is the proportion of times in a simulation the p-values associated with exposure were less than the magic number as the number of controls increase. Only here, the controls are just made up numbers. I fed 20,000 simulated malady yes-or-no data points consistent with the EPA’s threshold (times 2!) into a logistic regression model, 10,000 for “exposed” and 10,000 for “not-exposed.” For the point labeled “Number of Useless Xs” equal to 0, that’s all I did. Concentrate on that point (lower-left).

About 0.05 of the 1,000 simulations gave wee p-values (dotted line), which is what frequentist theory predicts. Okay so far. Now add 1 useless control (or “X”), i.e. 20,000 made-up numbers1 which were picked out of thin air. Notice that now about 20% of the simulations gave “statistical significance.” Not so good: it should still be 5%.

Add some more useless numbers and look what happens: it becomes almost a certainty that the p-value associated with exposure will fall less than the magic number. In other words, adding in “controls” guarantees you’ll be making a mistake and saying exposure is dangerous when it isn’t.2 How about that? Readers needing grant justifications should be taking notes.

The dashed line is for p-values less than the not-so-magic number of 0.1, which is sometimes used in desperation when a p-value of 0.05 isn’t found.

The number of “controls” here is small compared with many studies, like the Jerrett papers referenced in the links above; Jerrett had over forty. Anyway, these numbers certainly aren’t out of line for most research.

A sample of 20,000 is a lot, too (but Jerrett had over 70,000), so here’s the same plot with 1,000 per group:

Figure 2

Figure 2

Same idea, except here notice the curve starts well below 0.05; indeed, at 0. Pay attention! Remember: there no “controls” at this point. This happens because it’s impossible to get a wee p-value for sample sizes this small when the probability of catching the malady is low. Get it? You cannot show “significance” unless you add in controls. Even just 10 are enough to give a 50-50 chance of falsely claiming success (if it’s a success to say exposure is bad for you).

Key lesson: even with nothing going on, it’s still possible to say something is, as long as you’re willing to put in the effort.3

Update You might suspect this “trick” has been played when in reading a paper you never discover the “raw” numbers, where all that is presented is a model. This does happen.


1To make the Xs in R: rnorm(1)*rnorm(20000); the first rnorm is for a varying “coefficient”. The logistic regression simulations were done 1,000 times for each fixed sample size at each number of fake Xs, using the base rate of 2e-4 for both groups and adding the Xs in linerally. Don’t trust me: do it yourself.

2The wrinkle is that some researchers won’t keep some controls in the model unless they are also “statistically significant.” But some which are not are also kept. The effect is difficult to generalize, but in the direction of we’ve done here. Why? Because, of course, in these 1000 simulations many of the fake Xs were statistically significant. Then look at this (if you need more convincing): a picture as above but only keeping, in each iteration, those Xs which were “significant.” Same story, except it’s even easier to reach “significance”.

3The only thing wrong with the pictures above is that half the time the “significance” in these simulations indicates a negative effect of exposure. Therefore, if researchers are dead set on keeping on positive effects, then numbers (everywhere but at 0 Xs) should be divided by about 2. Even then, p-values perform dismally. See Jerrett’s paper, where he has exposure to increasing ozone as beneficial for lung diseases. Although this was the largest effect he discovered, he glossed over it by calling it “small.” P-values blind.


Bacteria Found In Holy Water

Safe at last!

Study making the rounds yesterday was “Holy springs and holy water: underestimated sources of illness?” in the Journal of Water & Healthnational chess master) and others.

They sampled holy water in Vienna churches and hospital chapels and discovered traces of Pseudomonas aeruginosa and Staphylococcus aureus, and where these come from you don’t want to know. However, it is clear from this evidence that at least some parishioners did not heed sister’s rule to wash after going.

The authors also traveled the city to its holy springs and found that about eighty-percent of these had various impurities, some of them at (European) regulatable levels.

Doubtless the findings of Kirschner are true—and of absolutely no surprise to anybody who reads (or helps create) the medical literature. Three or four times a year new studies issue forth showing that doorknobs have bacteria on them, or that the pencil you’re chewing on has lingering traces of some bug, or that doctor’s ties (I did this) are not only ugly but happy home to nasties of all sorts.

So many studies like this are there that it is safe to conclude that absolutely everywhere and everything is infected and that the only sterile place on the planet is in one of those bubbles John Travolta gadded about in in the 1976 beloved classic The Boy in the Plastic Bubble.

Since the stated purpose of the authors was to “raise public awareness” of the dangers lurking in holy water, I’ll do my bit to help. It’s good advice not to sip from the parish font or to get too cozy with the aspersory. Not only could it be injurious to your health, but it’s in bad taste.

The authors also recommend not drinking from holy springs because they fret over its little wigglies. But since there’s little evidence of a practical effect from this—lots of people drink from the springs without keeling over—it’s probably not worth changing your habits. Keep opening doors, too, and chewing on pencils and go to your doctor even though he wears a tie.

(There’s a nun joke in there somewhere, but I’m still jet lagged. Invent your own.)


Econometric Drinking Games, WSJ Edition: Update

Two economists researching new games.

Jim Fedako sent in this Wall Street Journal column, written by one Dan Ariely, a “Professor of Psychology and Behavioral Economics.”

A lady wrote Ariely asking for economic party games. Ariely suggested this one:

Give each of your guests a quarter and ask them to predict whether it will land heads or tails, but they should keep that prediction to themselves. Also tell them that a correct forecast gets them a drink, while a wrong one gets them nothing.

Then ask each guest to toss the coin and tell you if their guess was right. If more than half of your guests “predicted correctly,” you’ll know that as a group they are less than honest. For each 1% of “correct predictions” above 50% you can tell that 2% more of the guests are dishonest. (If you get 70% you will know that 40% are dishonest.) Also, observe if the amount of dishonesty increases with more drinking. Mazel tov, and let me know how it turns out!

Let’s see how useful these rules are.

Regular readers have had it pounded into their heads that probability is always conditional: we proceed from fixed evidence and deduce its logical relation to some proposition of interest. The proposition here is some number of individuals guessing correctly on coin flips.

What is our evidence? The standard bit about coins plus what we know about a group of thirsty bored people. Coin evidence: two-sided object, just one side of which is H, the other T, which when flipped shows only one. Given that evidence, the probability of an H is 1/2, etc. That’s also the probability of guessing correctly, assuming just the coin evidence.

If there were one party guest, the probability is thus 1/2 she’ll guess right. Obviously 100% of the guests claimed accuracy, and we can score the game using Ariely’s rules. Take the percentage of guests who predicted accurately over 50% and multiply this percentage by 2%. (He gave the example of 70% correct guesses, which is 20% over 50%, and 20% x 2% = 40% dishonest guests.)

Since 100% of the guests claimed accuracy, our example has 50% above 50%, thus “you can tell” 2% x 50% = 100% of the guests are cheating. Harsh! You’d toss your invitee out on her ear before she could even take a sip.

If there were two guests, the probability both honestly shout “Down the hatch!” is 25%. How? Well, both could guess wrong, the first one right with the second wrong, the first wrong with the second right, or both right. 25% chance for the last, as promised. Suppose both were honestly right. We again have 100% correct answers, making another 50% above 50%. According to Ariely, we can tell 2% x 50% or 100% “of the guests are dishonest.” Tough game! Seems we’re inviting people over for the express purpose of calling them liars.

Now suppose just one guest (of two) claimed he was right. We have 0% over 50%, or 2% x 0% = 0% dishonest guests. But the gentlemen who claimed accuracy, or even both guests, easily could have been lying. The second who said she guessed incorrectly might have been a teetotaler wanting to be friendly. Or the second could have guessed incorrectly, and so did the first but he really needed a drink. Who knows?

If you had 10 guests and 6 claimed accuracy, then (with an excesses of 10%) 2% x 10% = 20% of your guests, or two of them, are labeled liars. Yet there is a 21% chance 6 people would guess correctly using just the coin information. Saying there are 2 liars with such a high chance of that many correct guesses is pretty brutal.

Ariely’s rules, in other words, are fractured.

So let’s think of workable games. I suggest two.

(1) Invite economists to use their favorite theory to make accurate predictions of any kind, three times successively. Those who fail must resign their posts, those who succeed are re-entered into the game and must continue playing until they are booted or they retire.

(2) Have guests be contestants in your own version of Monty Hall. Use cards: two number cards as “empty” doors and an Ace as the prize. Either reward your guests with a drink for (eventually) picking correctly, or punish them with one for picking incorrectly (if you think drinking is a sin).

Update In this original version I misspelled, in two different ways (not a record), Ariely’s name. I beg his pardon.

Update Mr Ariely was kind enough to respond to me via email, where he said he had in mind a party with a very large number of guests. This was my reply:

Hi Dan,

I supposed that’s what you meant, but it’s still wrong, unfortunately.

If you had 100 guests there’s a 7.8% chance 51 guess correctly (and truthfully). But the rules say 1% x 2% = 2% of the guests, or 2 of them, are certainly lying. Just can’t get there from here.

Worse, the more people there are the more the situation resembles the one with just two guests, where both forecasted incorrectly but where one said he was right. In that case the rules say nobody cheated. But one did.

The more guests there are the easier it is to cheat and not be accused of cheating, too. You just wait until you see how many people said they were right, and as long as this number isn’t going to make 50 or so, you can lie (if you had to) and never be accused.

There’s no fixing the game, either. Suppose all 100 guests said they answered correctly. Suspicious, of course, but since there is a positive chance this could happen, you can’t claim (with certainty) *anybody* lied. All you could do is glare at the group and say, “The chance that all of you are telling the truth is only 10^-30!”

But then some wag will retort, “Rare things happen.” To which there is no reply.

There might be a way to make a logic game of this, but my head is still fuzzy from jet lag and I can’t think of it.

Also, apologies for (originally) misspelling your name!



Most Probabilities Aren’t Quantifiable

Look at those colorful numbers!

Look at those colorful numbers!

We’ve done this before in different form. But it hasn’t stuck; plus we need this for reference.

Not all probability is quantifiable. The proof for this is simple: all that must be demonstrated is one probability that cannot be made into a unique number. I’ll do this in a moment, but first it is interesting to recall that in its infancy it wasn’t clear probability could or should be represented numerically. (See Jim Franklin’s terrific The Science of Conjecture: Evidence and Probability Before Pascal.) It is only obvious probability is numerical when you’ve grown up subsisting solely on a diet of numbers, a condition true any working scientist.

The problem is because some probabilities are numerical, the only time it feels real, scientific and weighty, is if it is stated numerically. Nobody wants to make a decisions based on mere words, not when figures can be used. Result? Over-certainty.


Kolmogorov in 1933’s Foundations of the Theory of Probability gave us stated axioms which put probability on a firm footing. Problem is, the first axiom said, or seemed to say, “probability is a number”, and so did the second (the third gave a rule for manipulating these numbers). The axioms also require a good dose of mathematical training to comprehend, which contributed to the idea probabilities are numbers.

Different, not-so-rigorous, but nevertheless appealing axioms were given by Cox in 1961. Their appeal was their statement in plain English and concordance with commonsense. (Cox’s lack of mathematical rigor was subsequently fixed by several authors.1) Now these axioms yield two interesting results. First is that probability is always conditional. We can never write (in standard symbols) Pr(A), which reads “The probability of proposition A”, but must write Pr(A|B), “The probability of A given the premise or evidence B.” This came as no shock to logicians, who knew that the conclusion of any argument must be “conditioned on” premises or evidence of some kind, even if this evidence is just our intuition. This result didn’t shock anybody else, either. Because it’s rarely remembered. Another victim of treating probability exclusively mathematically.

The second result sounds like numbers. Certainty has probability 1, falsity probability 0, just as expected. And, given some evidence B, the probability of some A plus the probability that A is false must equal 1: that is, it is a certainty (given B) that either A or not-A is true. Numbers, but only sort of, because there is no proof that for any A or B, Pr(A|B) will be a number. And indeed, there can be no proof, as you’ll discover. In short: Cox’s proofs are not constructive.

Cox’s axioms (and their many variants) are known, or better to say, followed by only a minority of physicists and Bayesian statisticians. They are certainly not as popular as Kolmogorov’s, even though following Cox’s trail can and usually does lead to Kolmogorov. Which is to say, to mathematics, i.e. numbers.

Numberless probability

Here’s our example of a numberless probability: B = “A few Martians wear hats” and A = “The Martian George wears a hat.” There is no unique Pr(A|B) because there is no unique map from “a few” to any number. The only way to generate a unique number is to modify B. Say B’ = “A few, where ‘a few’ means 10%, Martians wear hats.” Then Pr(A|B’) = 0.1. Or B” = “A few, where ‘a few’ means never more than one-half…” Then 0 < Pr(A|B”) < 0.5. It should be obvious that B is not B’ nor B” (if it isn’t, you’re in deep kimchi). More examples are had by changing “a few” to “some”, “most”, “a bunch”, “not so many” and on and on, none of which lead to a unique probability. This is all true even though, in each case, Pr(A|B) + Pr(not-A|B) = 1. (Why? Because that formula is a tautology.)

It turns out most probability isn’t quantifiable, because most judgments of uncertainty cannot be and are not stated numerically. “Scientific” propositions, many of which can be quantified, are very rare in human discourse. Consider this, from which you will see it easy to generate endless examples. B (spoken by Bill) = “I might go over to Bob’s” as the sole premise for A = “Bill will go to Bob’s”. Note very carefully that this is your premise, not Bill’s. It is your uncertainty in A given B that is of interest. The only way to come to a definite number is by adding to B; perhaps by your knowledge of Bill’s habits. But if you were a bystander and overheard the conversation, you wouldn’t know how to add to B, unless you did so by subtle hints of Bill’s dress, his mannerisms, and things like that. Anyway, all these change B, and make it into something which is not B. That’s cheating. If asked for Pr(A|B) one must provide Pr(A|B) and not Pr(A|B’) or anything else.

This seemingly trivial rule is astonishingly difficult to remember or to heed if one is convinced probability is numerical. It would never be violated when working through a syllogism, say, or calculating a mathematical proof, where blatant additions to specified evidence are rejected out of hand. A professor would never let a student change the problem so that the student can answer it. Not so with probabilities. People will change the problem to make it more amenable. “Subjective” Bayesians make a career out of it.

Why is the rule so hard? No sooner will you ask somebody what is Pr(A|B) and they’ll say, “Well there’s lot of factors to consider…” There are not. There is only one, and that is B’s logical relation to A. Anything else, however interesting, is not relevant. Unless one wants to change the problem and discover the plausible evidence B’ which gives A its most extreme probability (nearest to 0 or 1). The modifier “plausible” is needed, because it is always possible to create evidence which makes A true or false (e.g. B = “A is impossible”). The plausibility is to fit the evidence into a larger scheme of propositions. This is a large topic, skipped here, because it is incidental.

Lots of detail left out here, which you have to fill in. See the classic posts page for how.

Update 2 Fixed the d*&^%^*&& typo that one of my enemies placed in the equation below. Rats!

Update An algebraic analogy. “If y = 1 and x + y < 7, solve for x.” There isn’t provided enough information to derive a unique value for x. It thus would be absurd, and obviously so, to say, “Well, I feel most x are positive; I mean, if I were to bet. And I’ve seen a lot of them around 3, though I’ve come across a few 4s too. I’m going with 3.”

Precision is often denied us. As silly as this example is, we see its equivalent occur in probability all the time.


1See inter alia Dupré and Tipler, 2009. New Axioms for Rigorous Bayesian Probability Bayesians Analysis, 3, 599-606.

« Older posts Newer posts »

© 2015 William M. Briggs

Theme by Anders NorenUp ↑