# William M. Briggs

### Statistician to the Stars!

#### Search results: "p value" (page 1 of 105)

Today’s title is lifted directly from the paper of E. J. Masicampo & Daniel R. Lalande, published in The Quarterly Journal of Experimental Psychology. The paper is here, and is free to download. The abstract says it all:

In null hypothesis significance testing (NHST), p values are judged relative to an arbitrary threshold for significance (.05). The present work examined whether that standard influences the distribution of p values reported in the psychology literature. We examined a large subset of papers from three highly regarded journals. Distributions of p were found to be similar across the different journals. Moreover, p values were much more common immediately below .05 than would be expected based on the number of p values occurring in other ranges. This prevalence of p values just below the arbitrary criterion for significance was observed in all three journals. We discuss potential sources of this pattern, including publication bias and researcher degrees of freedom.

File our reaction under “Captain Renault, Shocked.” Regular readers will know that the cult of the p-value practically guarantees Masicampo & Lalande’s result (Pp<0.0001). We will not be surprised if this finding is duplicated in other journals.

Here’s what happened: Our stalwart pair thumbed through back issues of several psychology journals and tabulated the appearance of 3,627 p-values, then plotted them:

A significant bump in p-values (Pp<0.0001)

Perhaps hard to see in this snapshot, there are unexpected bumps in the distribution of p-values at the magic value, the value below which life is grand, the number above which consists of weeping and gnashing of teeth. Understand that these are the p-values that are scattered throughout papers and not just “the” p-values which “prove” that the authors’ preconceptions are “true,” i.e. the p-valus of the main hypotheses.

Masicampo and Lalande rightly conclude:

This anomaly is consistent with the proposal that researchers, reviewers, and editors may place undue emphasis on statistical significance to determine the value of scientific results. Biases linked to achieving statistical significance appear to have a measurable impact on the research publication process.

The only thing wrong with the first sentence is the word “may”, which can be deleted; the deletion of the second sentence is “appear.”

Why p-values? Why are they so beloved? Why, given their known flaws and their ease of abuse, are they tolerated? Well, they are a form of freedom. P-values make the decision for you: thinking is not necessary. A number less than the magic threshold is seen as conclusive, end of story. Plug your data into automatic software and out pops the answer, ready for publishing.

But this reliance “exposes an overemphasis on statistical significance, which statisticians have long argued is hurtful to the field (Cohen, 1994; Schmidt, 1996) due in part because p values fail to convey such useful information as effect size and likelihood of replication (Clark, 1963; Cumming, 2008; Killeen, 2005; Kline, 2009; Rozeboom, 1960).”

I left those references in so you can see that it is not just Yours Truly who despairs over the use of p-values. One of these references is “Cohen, J. (1994). The earth is round (p<.05). American Psychologist, 49, 997–1003.” This is a well-known paper, written by a non-statistician, which I encourage you to see out and read.

The real finding is the subtle confirmation bias that seeps into peer-reviewed papers. The, conscious or not shading of results in the direction the of authors’ hope. Everybody thinks confirmation bias happens to the other guy. Nobody can see his own fingers slip.

Everybody also assumes that the other fellows publishing papers are “basically honest.” And they are, basically. More or less.

Update Reader Koene Van Dijk notes the paper is no longer available free, but gives us the email address of the authors: masicaej@wfu.edu or lalande.danielr@gmail.com.

———————————————————-

Thanks to the many readers who sent this one in, including Dean Dardno, Bob Ludwick, Gary Boden; plus @medskep at Twitter from whom I first learnt of this paper.

I apologize for the abruptness of the notation. It will be understandable only to a few. I don’t like to use it without sufficient background because the risk of reification is enormous and dangerous. But if I did the build up (as we’re doing in the Evidence thread), I’d risk a revolt. So here is the alternative to p-values—to be used only in those rare cases where probability is quantifiable.

Warning two: for non-mathematical statisticians, the recommendations here won’t make much sense. Sorry for that. But stick around and I’ll do this all over more slowly, starting from the beginning. Start with this thread.

Note in vain attempt to ward off reification: discrete probability, assumed here, is always preferred to continuous, because nothing can be measured to infinite precision, nor can we distinguish infinite gradations in decisions.

Our Goal

We want:

$\Pr(Y = y | X_1 = a, X_2 = b,\dots, X_q = z, \mbox{other evidence})$

where we are interested in the proposition Y = “We see the value y (taken by some thing)” given, or conditioned on, the propositions X1 = “We assume a”, etc., and “other evidence”, which is usually but need not be old values of y and the “Xs”.

The relationship between the Xs and Y, and the old data, is usually specified by a formal probability model itself characterized by unobservable parameters. The number of parameters is typically close to the number of Xs, but could be higher or lower depending on the type of probability model and how much causality is built into it. The “other evidence” incorporates whatever (implicit) evidence suggested the probability model.

P-values are born in frequentist thinking and are usually conditioned on one of these parameters taking a specific value. Bayesian practice at least inverts this to something more sensible, and states the “posterior” probability distribution of the “parameter of interest.”

Problem is, the parameter isn’t of interest. The value of y is. Asking a statistician about the value of y is like asking a crazed engineer what the temperature of the room is and all he will talk about is the factory setting of the bias voltage of some small component in the thermostat.

The Alternative

The goal of the model is to say whether X1 etc. is important in understanding the uncertainty of Y. P-values and posteriors dance around the question. Why not answer it directly? Instead of p-values and posteriors, calculate the probability of y given various values of the Xs. One way is this:

$p_1 = \Pr(Y = y | X_1 = \theta_1, X_2 = b,\dots, X_q = z, \mbox{other evidence})$

and

$p_2 = \Pr(Y = y | X_1 = \theta_2, X_2 = b,\dots, X_q = z, \mbox{other evidence})$

where $\theta_1$ and $\theta_2$ are values of X1 that are “sensibly different” (enough that you can make a decision on the difference), and where the values b, c, …, z make sense for the other Xs in the model. Notice the absence of parameters: if they were there once, they are now “integrated out” (actually summed over, since we’re discrete here). They are not “estimated” here because they are of zero interest.

If p1 and p2 are far apart, such that it would alter a decision you would make about y, then X1 is important and can be kept in consideration (in the model). If p1 and p2 are close, and would not cause you to change a decision about y were X1 to move from $\theta_1$ to $\theta_2$, then X1 is not important. Whether it’s dropped from the model is up to you.

Gee, that’s a lot of work. “I have to decide about a, b, c and all the rest as well as $\theta_1$ and $\theta_2$, and I have to figure how far apart p1 and p2 are to be ‘far’ apart?” Well, yes. Hey, it was you who put all those other Xs into consideration. If they’re in the model, you have to think about them. All that stuff interacts, or rather affects, your knowledge of y. Tough luck. Easy answers are rare. The problem was that people, using p-values, thought answers were easy.

All this follows from the truth that all probability is conditional. The conditions are the premises or evidence we put there, and the model (if any) that is used. Whether any given probability is “important” depends entirely on what decisions you make based on it. That means a probability can be important to one person and irrelevant to another.

Now it’s easy enough to give recommendations about picking $\theta_1$ to $\theta_2$ and all the rest, but I’m frightened to do so, because these can attain mythic status, like the magic number for p-values. If you’re presenting a model’s results for others, you can’t anticipate what decisions they’ll make based on it, so it’s better to present results in as “raw” a fashion as possible.

Why is this method preferred? Decisions made using p-values are fallacious, they, and even Bayesian posteriors, do not answer the questions you really want to know, and, best of all, this method allows you to directly check the usefulness of the model.

P-values and Bayesian posteriors are hit-and-run statistics. They gather evidence, posit a model, then speak (more or less) about some setting of a knob of that model as if that knob were reality. Worst, the model and conclusions reached are never checked using new information. Using this new observable method, as is in use in physics, chemistry, etc. (though they might not know it), allows one to verify the model. And, boy, would that cut down on the rampant over-certainty plaguing science.

Variation On A Theme

Note: another method for the above is:

$\Pr(y_{\theta_1} > y_{\theta_2} | X_1, X_2 = b,\dots, X_q = z, \mbox{other evidence})$

assuming (the notation changes slightly here) y can take lots of values (like sales, or temperature, etc.). If the probability of seeing larger values of y under $\theta_1$ is “large” then X1 is important, else not.

Statistics is the only field in which men boast of their wee p-values.

Handy PDF of this post

They are based on a fallacious argument.

Repeated in introductory texts, and began by Fisher himself, are words very like these (these were adapted from Fisher, R. 1970. Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, fourteenth edition):

Belief in a null hypothesis as an accurate representation of the population sampled is confronted by a logical disjunction: Either the null hypothesis is false, or the p-value has attained by chance an exceptionally low value.

Fisher’s choice of words was poor. This is evidently not a logical disjunction, but can be made into one with slight surgery:

Either the null hypothesis is false and we see a small p-value, or the null hypothesis is true and we see a small p-value.

Stated another way, “Either the null hypothesis is true or it is false, and we see a small p-value.” Of course, the first clause of this proposition, “Either the null hypothesis is true or it is false”, is a tautology, a necessary truth, which transforms the proposition to “TRUE and we see a small p-value.” Or, in the end, Fisher’s dictum boils down to:

We see a small p-value.

In other words, a small p-value has no bearing on any hypothesis (unrelated to the p-value itself, of course). Making a decision because the p-value takes any particular value is thus always fallacious. The decision may be serendipitously correct, as indeed any decision based on any criterion might be, and as it often likely correct because experimenters are good at controlling their experiments, but it is still reached by a fallacy.

People believe them.

Whenever the p-value is less than the magic number, people believe or “act like” the alternate hypothesis is true, or very likely true. (The alternate hypothesis is the contradiction of the null hypothesis.) We have just seen this is fallacious. Compounding the error, the smaller the p-value is, the more likely people believe the alternate hypothesis true.

This is also despite the strict injunction in frequentist theory that no probability may be assigned to the truth of the alternate hypothesis. (Since the null is the contradiction of the alternate, putting a probability on the truth of the alternate also puts a probability on the truth of the null, which is also thus forbidden.) Repeat: the p-value is silent as the tomb on the probability the alternate hypothesis is true. Yet nobody remembers this, and all violate the injunction in practice.

People don’t believe them.

Whenever the p-value is less than the magic number, people are supposed to “reject” the null hypothesis forevermore. They do not. They argue for further testing, additional evidence; they say the result from just one sample is only a guide; etc., etc. This behavior tacitly puts a (non-numerical) probability on the alternate hypothesis, which is forbidden.

It is not the non-numerical bit that makes it forbidden, but the act of assigning any probability, numerical or not. The rejection is said to have a probability being in error, but this is only for samples in general in “the long run”, and never for the sample at hand. If it were for the sample at hand, the p-value would be putting a probability on the truth of the alternate hypothesis, which is forbidden.

They are not unique: 1.

Test statistics, which are formed in the first step of the p-value hunt, are arbitrary, subject to whim, experience, culture. There is no unique or correct test statistic for any given set of data and model. Each test statistic will give a different p-value, none of which are preferred (except by pointing to evidence outside the experiment). Therefore, each of the p-values are “correct.” This is perfectly in line with the p-value having nothing to say about the alternate hypothesis, but it encourages bad and sloppy behavior on the part of p-value purveyors as they seek to find that which is smallest.

They are not unique: 2.

The probability model representing the data at hand is usually ad hoc; other models are possible. Each model gives different p-values for the same (or rather equivalent) null hypothesis. Just as with test statistics, each of these p-values are “correct,” etc.

They can always be found.

Increasing the sample size drives p-values lower. This is so well known in medicine that people quote the difference between “clinical” versus “statistical” significance. Strangely, this line is always applied to the other fellow’s results, never one’s own.

They encourage magical thinking.

Few remember its definition, which is this: Given the model used and the test statistic dependent on that model and given the data seen and assuming the null hypothesis (tied to a parameter) is true, the p-value is the probability of seeing a test statistic larger (in absolute value) than the one actually seen if the experiment which generated the data were run an indefinite number of future times and where the milieu of the experiment is precisely the same except where it is “randomly” different. The p-value says nothing about the experiment at hand, by design.

Since that is a mouthful, all that is recalled is that if the p-value is less than the magic number, there is success, else failure. P-values work as charms do. “Seek and ye shall find a small p-value” is the aphorism on the lips of every researcher who delves into his data for the umpteenth time looking for that which will excite him. Since wee p-values are so easy to generate, his search will almost always be rewarded.

They focus attention on the unobservable.

Parameters–the creatures which live inside probability models but which cannot be seen, touched, or tasted—are the bane of statistics. Inordinate attention is given them. People wrongly assume that the null hypotheses ascribed to parameters map precisely to hypotheses about observables. P-values are used to “fail to reject” hypotheses which nobody believes true; i.e. the parameter in a regression is precisely, exactly, to infinite decimal places zero. Confidence in real-world observables must always be necessary lower than in confidence in parameters. Null hypotheses are never “accepted”, incidentally, because that would violate Fisher’s (and Popper’s) falsificationist philosophy.

They base decisions on what did not occur.

They calculate the probability of what did not happen on the assumption that what didn’t happen should be rare. As Jefferys famously said: “What the use of P[-value] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.”

Fans of p-values are strongly tempted to this fallacy.

If a man shows that a certain position you cherish is absurd or fallacious, you multiply your error by saying, “Sez you! The position you hold has errors, too. That’s why I’m going to still use p-values. Ha!” Regardless whether the position the man holds is keen or dull, you have not saved yourself from ignominy. Whether you adopt logical probability or Bayesianism or something else, you must still abandon p-values.

Confidence intervals.

No, confidence intervals are not better. That for another day.

Handy PDF of this post

This is what over-exposure to p-values can lead to.

Today’s lesson: If the government wants you bad enough, it will get you. If that isn’t already obvious, consider what befell W. Scott “Don’t Call Me Baron” Harkonen.

Just kidding with the Dune reference. Harkonen was imprisoned by the Padishah—stop that!—by our beneficent government for the most heinous crime of using a p-value which his competitors did not like.

I do not joke nor jest. Harkonen got six months house arrest for writing these words in a press release:

InterMune Announces Phase III Data Demonstrating Survival Benefit of Actimmune in IPF [idiopathic pulmonary fibrosis]. Reduces Mortality by 70% in Patients with Mild to Moderate Disease.

According to the Washington Post,”What’s unusual is that everyone agrees there weren’t any factual errors in the [press release]. The numbers were right; it’s the interpretation of them that was deemed criminal.” Post further said, “There was some talk that if Harkonen had just admitted more uncertainty in the press release—using the verb ‘suggest’ rather than ‘demonstrate’—he might have avoided prosecution.”

Harkonen followed FDA-government rules and ran a trial of his company’s drug actimmune (interferon gamma-1b) in treating IPF, hoping patients who got the drug would live longer than those fed a placebo. This happened: 46% of actimmune patients kicked over while 52% of the placebo patients handed in their dinner pails.

Unfortunately, the p-value for this observed difference was just slightly higher than the magic number: it was 0.08.

Wait! Tell me the practical difference between 0.08 and the magic number? You cannot do so. That is what makes the magic number magic. Occult thinking is rife in classical statistics. There is no justification given for the magic of the magic number other than it is magic. And it is magic because other people, Bene Gesserit fashion (last one), have said it is magic.

Therefore, p-values greater than the magic number are “insignificant.” The FDA shuns p-values that don’t fit into the special magic slot. Harkonen, holding his extra-large p-value, knew this. And wept.

I’m guessing about the weeping. But Harkonen surely knew about the mystical threshold, because he dove back into his data where he discovered that the survival difference in patients with “mild to moderate cases of the disease” was even greater, a difference which gave the splendiferously magical p-value of 0.004.

So wee was this new p-value and so giddy was Harkonen that he wrote that press release.

Which caught the attention of his enemies (rival drug company?) who ratted him out to the Justice Department’s office of consumer litigation, which, being populated by lawyers paid to snare citizens, did their duty on Harkonen.

Harkonen’s crime? Well, in classical statistics the pre-announced “primary endpoint”, what happened to all and not a subset of patients, is the only thing that should have counted. The “secondary analysis”, especially when it’s not expected, is feared and should not be used.

And rightly so when using p-values, because as long as the data set is large and rich enough, wee p-values can always be discovered even when nothing is happening, which in this case means even when the drug doesn’t work. The government therefore assumed the drug didn’t work and that Harkonen should not have used the word “demonstrated”, which it interpreted as meaning “a wee p-value less than the magic number was found.”

What makes the story pathetic is that Harkonen forgot when he got his 0.08 that the p-value is dependent on the model he picked. He could have picked another, one which gave him a smaller p-value. He could have kept searching for models until one issued a magic p-value. He might not have found one, but there’s so many different classical test statistics that it would have been worth looking.

Which of these p-values is “the” correct one? All of them!

Insult onto injury time. As Harkonen rattled his coffee cup against his mullions (house arrest, remember), his old company did a new, bigger trial on just the subset of patients who did better before. Result: more deaths in the drug than placebo group. Oops.

Anyway, maybe we should let the government, for a limited period of time, arrest and jail scientists who publicly boast of wee p-values and whose theories turn out to be garbage. Nah. Our prisons aren’t nearly big enough to handle it.

Update Don’t miss the comment by Nathan Schachtman, who filed an amicus brief on Harkonen’s behalf. It’s linked below.

——————————————————————–

Thanks to Al Perrella for finding this.

Today’s evidence is not new; is, in fact, well known. Well, make that just plain known. It’s learned and then forgotten, dismissed. Everybody knows about these kinds of mistakes, but everybody is sure they never happen to them. They’re too careful; they’re experts; they care.

It’s too easy to generate “significant” answers which are anything but significant. Here’s yet more—how much do you need!—proof. The pictures below show how easy it is to falsely generate “significance” by the simple trick of adding “independent” or “control variables” to logistic regression models, something which everybody does.

Let’s begin!

Recall our series on selling fear and the difference between absolute and relative risk, and how easy it is to scream, “But what about the children!” using classical techniques. (Read that link for a definition of a p-value.) We anchored on EPA’s thinking that an “excess” probability of catching some malady when exposed to something regulatable of around 1 in 10 thousand is frightening. For our fun below, be generous and double it.

Suppose the probability of having the malady is the same for exposed and not exposed people—in other words, knowing people were exposed does not change our judgment that they’ll develop the malady—and answer this question: what should any good statistical method do? State with reasonable certainty there aren’t different chances of infection between being exposed and not exposed groups, that’s what.

Frequentist methods won’t do this because they never state the probability of any hypothesis. They instead answer a question nobody asked, about some the values of (functions of) parameters in experiments nobody ran. In other words, they give p-values. Find one less than the magic number and your hypothesis is believed true—in effect and by would-be regulators.

Logistic regression

Logistic regression is a common method to identify whether exposure is “statistically significant”. Readers interested in the formalities should look at the footnotes in the above-linked series. Idea is simple enough: data showing whether people have the malady or not and whether they were exposed or not is fed into the model. If the parameter associated with exposure has a wee p-value, then exposure is believed to be trouble.

So, given our assumption that the probability of having the malady is identical in both groups, a logistic regression fed data consonant with our assumption shouldn’t show wee p-values. And the model won’t, most of the time. But it can be fooled into doing so, and easily. Here’s how.

Not just exposed/not-exposed data is input to these models, but “controls” are, too; sometimes called “independent” or “control variables.” These are things which might affect the chance of developing the malady. Age, sex, weight or BMI, smoking status, prior medical history, education, and on and on. Indeed models which don’t use controls aren’t considered terribly scientific.

Let’s control for things in our model, using the same data consonant with probabilities (of having the malady) the same in both groups. The model should show the same non-statistically significant p-value for the exposure parameter, right? Well, it won’t. The p-value for exposure will on average become wee-er (yes, wee-er). Add in a second control and the exposure p-value becomes wee-er still. Keep going and eventually you have a “statistically significant” model which “proves” exposure’s evil effects. Nice, right?

Simulations

Take a gander at this:

Figure 1

Follow me closely. The solid curve is the proportion of times in a simulation the p-values associated with exposure were less than the magic number as the number of controls increase. Only here, the controls are just made up numbers. I fed 20,000 simulated malady yes-or-no data points consistent with the EPA’s threshold (times 2!) into a logistic regression model, 10,000 for “exposed” and 10,000 for “not-exposed.” For the point labeled “Number of Useless Xs” equal to 0, that’s all I did. Concentrate on that point (lower-left).

About 0.05 of the 1,000 simulations gave wee p-values (dotted line), which is what frequentist theory predicts. Okay so far. Now add 1 useless control (or “X”), i.e. 20,000 made-up numbers1 which were picked out of thin air. Notice that now about 20% of the simulations gave “statistical significance.” Not so good: it should still be 5%.

Add some more useless numbers and look what happens: it becomes almost a certainty that the p-value associated with exposure will fall less than the magic number. In other words, adding in “controls” guarantees you’ll be making a mistake and saying exposure is dangerous when it isn’t.2 How about that? Readers needing grant justifications should be taking notes.

The dashed line is for p-values less than the not-so-magic number of 0.1, which is sometimes used in desperation when a p-value of 0.05 isn’t found.

The number of “controls” here is small compared with many studies, like the Jerrett papers referenced in the links above; Jerrett had over forty. Anyway, these numbers certainly aren’t out of line for most research.

A sample of 20,000 is a lot, too (but Jerrett had over 70,000), so here’s the same plot with 1,000 per group:

Figure 2

Same idea, except here notice the curve starts well below 0.05; indeed, at 0. Pay attention! Remember: there no “controls” at this point. This happens because it’s impossible to get a wee p-value for sample sizes this small when the probability of catching the malady is low. Get it? You cannot show “significance” unless you add in controls. Even just 10 are enough to give a 50-50 chance of falsely claiming success (if it’s a success to say exposure is bad for you).

Key lesson: even with nothing going on, it’s still possible to say something is, as long as you’re willing to put in the effort.3

Update You might suspect this “trick” has been played when in reading a paper you never discover the “raw” numbers, where all that is presented is a model. This does happen.

———————————————————————

1To make the Xs in R: rnorm(1)*rnorm(20000); the first rnorm is for a varying “coefficient”. The logistic regression simulations were done 1,000 times for each fixed sample size at each number of fake Xs, using the base rate of 2e-4 for both groups and adding the Xs in linerally. Don’t trust me: do it yourself.

2The wrinkle is that some researchers won’t keep some controls in the model unless they are also “statistically significant.” But some which are not are also kept. The effect is difficult to generalize, but in the direction of we’ve done here. Why? Because, of course, in these 1000 simulations many of the fake Xs were statistically significant. Then look at this (if you need more convincing): a picture as above but only keeping, in each iteration, those Xs which were “significant.” Same story, except it’s even easier to reach “significance”.

3The only thing wrong with the pictures above is that half the time the “significance” in these simulations indicates a negative effect of exposure. Therefore, if researchers are dead set on keeping on positive effects, then numbers (everywhere but at 0 Xs) should be divided by about 2. Even then, p-values perform dismally. See Jerrett’s paper, where he has exposure to increasing ozone as beneficial for lung diseases. Although this was the largest effect he discovered, he glossed over it by calling it “small.” P-values blind.