Statistics

Use The Wrong P-value, Go To Jail: Not A Joke: Updated With Amicus Brief

This is what over-exposure to p-values can lead to.

This is what over-exposure to p-values can lead to.

Today’s lesson: If the government wants you bad enough, it will get you. If that isn’t already obvious, consider what befell W. Scott “Don’t Call Me Baron” Harkonen.

Just kidding with the Dune reference. Harkonen was imprisoned by the Padishah—stop that!—by our beneficent government for the most heinous crime of using a p-value which his competitors did not like.

I do not joke nor jest. Harkonen got six months house arrest for writing these words in a press release:

InterMune Announces Phase III Data Demonstrating Survival Benefit of Actimmune in IPF [idiopathic pulmonary fibrosis]. Reduces Mortality by 70% in Patients with Mild to Moderate Disease.

According to the Washington Post,”What’s unusual is that everyone agrees there weren’t any factual errors in the [press release]. The numbers were right; it’s the interpretation of them that was deemed criminal.” Post further said, “There was some talk that if Harkonen had just admitted more uncertainty in the press release—using the verb ‘suggest’ rather than ‘demonstrate’—he might have avoided prosecution.”

Harkonen followed FDA-government rules and ran a trial of his company’s drug actimmune (interferon gamma-1b) in treating IPF, hoping patients who got the drug would live longer than those fed a placebo. This happened: 46% of actimmune patients kicked over while 52% of the placebo patients handed in their dinner pails.

Unfortunately, the p-value for this observed difference was just slightly higher than the magic number: it was 0.08.

Wait! Tell me the practical difference between 0.08 and the magic number? You cannot do so. That is what makes the magic number magic. Occult thinking is rife in classical statistics. There is no justification given for the magic of the magic number other than it is magic. And it is magic because other people, Bene Gesserit fashion (last one), have said it is magic.

Therefore, p-values greater than the magic number are “insignificant.” The FDA shuns p-values that don’t fit into the special magic slot. Harkonen, holding his extra-large p-value, knew this. And wept.

I’m guessing about the weeping. But Harkonen surely knew about the mystical threshold, because he dove back into his data where he discovered that the survival difference in patients with “mild to moderate cases of the disease” was even greater, a difference which gave the splendiferously magical p-value of 0.004.

So wee was this new p-value and so giddy was Harkonen that he wrote that press release.

Which caught the attention of his enemies (rival drug company?) who ratted him out to the Justice Department’s office of consumer litigation, which, being populated by lawyers paid to snare citizens, did their duty on Harkonen.

Harkonen’s crime? Well, in classical statistics the pre-announced “primary endpoint”, what happened to all and not a subset of patients, is the only thing that should have counted. The “secondary analysis”, especially when it’s not expected, is feared and should not be used.

And rightly so when using p-values, because as long as the data set is large and rich enough, wee p-values can always be discovered even when nothing is happening, which in this case means even when the drug doesn’t work. The government therefore assumed the drug didn’t work and that Harkonen should not have used the word “demonstrated”, which it interpreted as meaning “a wee p-value less than the magic number was found.”

What makes the story pathetic is that Harkonen forgot when he got his 0.08 that the p-value is dependent on the model he picked. He could have picked another, one which gave him a smaller p-value. He could have kept searching for models until one issued a magic p-value. He might not have found one, but there’s so many different classical test statistics that it would have been worth looking.

Which of these p-values is “the” correct one? All of them!

Insult onto injury time. As Harkonen rattled his coffee cup against his mullions (house arrest, remember), his old company did a new, bigger trial on just the subset of patients who did better before. Result: more deaths in the drug than placebo group. Oops.

Anyway, maybe we should let the government, for a limited period of time, arrest and jail scientists who publicly boast of wee p-values and whose theories turn out to be garbage. Nah. Our prisons aren’t nearly big enough to handle it.

Return to this page Sunday for the highly anticipated post: Everything Wrong with P-Values Under One Roof.


Update Don’t miss the comment by Nathan Schachtman, who filed an amicus brief on Harkonen’s behalf. It’s linked below.

——————————————————————–

Thanks to Al Perrella for finding this.

Categories: Statistics

31 replies »

  1. You are certainly correct–our prisons are way to full to jail people for publicly boasting of wee p-values.

    Most disheartening with the second drug trial. Makes you want to cry……

  2. Funny thing, that’s how they got Galileo: for being way too certain about his conclusion, given the evidence then on hand. He got house arrest, too, though he knew nothing of p-values.

    But the fault, dear Brutus, lies not in our p-values, but in the foolish manner in which we use them. We always tried to teach folks that if (for example) one production line using fresh hydroxide “differs significantly” in a particular product characteristic from another production line using recycled hydroxide, the physical cause may be some other difference between the two production lines, and not necessarily in the kind of hydroxide used. As Brian Joiner always said, we must beware of “lurking variables.” That is why, inter alia, we taught that one must always perform confirmation runs. BMS did this, we note, and the problem was Twitter, which leads to premature ejaculations of success. Pfui, sez I.

  3. MattS: Legalize a lot of things and there will be room in prisons.

    Seriously, though, we put people in prison for writing bad checks, etc. There is a serious problem with the punishment fitting the crime. Maybe we could do a study. 🙂

  4. YOS,

    Yet in those examples you give, the p-value remains silent. Death to p-values!

    The real problem is a government that can make a criminal of a man who did so little.

  5. The source article ends with:

    United States v. Harkonen is one more milestone in the long and winding road toward determining when in America a false statement is a crime.

    As a general rule, for speech to lose protection of the First Amendment, it must fall into such categories as obscenity, incitement to imminent lawless action, fraud, perjury and false commercial speech. Only some of those categories involve untrue statements. Over the years, rulings have occasionally hinged on such things as the size of the lettering on a product label and the views of people in focus groups.

    That’s why justice happens by accident in the courtroom.

  6. Sheri,

    My understanding is that over half the US prison population is in on drug charges, so legalizing recreational drugs would free up more space than anything else.

  7. MattS,
    You may be correct. However, one must decide if there are compelling reasons for keeping some drugs illegal. What about prescription drugs that are legal but controlled? Just some thoughts.

    It’s a complex issue, unless the only question is how do free up the most jail space. If that’s the question, then I would agree with your answer. Not jailing people on drug charges would clean out cells. (Hopefully we don’t then put people into those cells for improper p-values. 🙂 )

  8. I’m a bit foxed by the conclusion to be drawn from this. The federal government prosecuted for what they deemed an over-interpretation. Subsequently a rerun demonstrated that it was indeed an over-interpretation: pretty much what you might expect after finding a marginal effect in the first place.
    The lesson seems to be that the first p-value was giving the right general impression – the treatment difference could reasonably have been chance. So the FDA got it right.
    Or have I missed something?

  9. The reason that Galileo got into trouble is long and complicated. There is a series of posts giving the background on this at The TOF Spot. Fascinating reading.

    The rule of law is rapidly being abandoned. I read somewhere that a hundred or so years ago, a person could live an entire life without contact with a government official. Oh halcyon days.

  10. Sheri,

    “Hopefully we don’t then put people into those cells for improper p-values.”

    Oh, I don’t know, I think that people who dump their p-values in public places should at least spend a couple of days in jail. 😀

  11. “Or have I missed something?”

    The thing you missed is that someone now has a criminal conviction on their record for over-interpreting the results of a scientific study.

  12. Scotian,

    “I read somewhere that a hundred or so years ago, a person could live an entire life without contact with a government official. Oh halcyon days.”

    Yes, an the way things are going in another hundred years the government officials will out number the rest of us and some government officials will be able to go their entire lives without having contact with a citizen. 🙁

  13. Some obvious thoughts (disclosure; a smattering of reckless sarcasm is included):

    1. The government in this situation got it right and the side on free enterprise got it wrong [on purpose]…so…clearly this means we need more government to rein in excess [inherently corrupt] free enterprise…Right?

    2. “splendiferously” — Late Latin splendōrifer splendor radiance + ferre to bring; facetious; grand; splendid; gorgeous.

    3. From comments…for what happened to Galileo & why read the actual transcripts (translated to English) at: http://law2.umkc.edu/faculty/projects/ftrials/galileo/galileo.html (from the U. of Missouri-Kansas City (UMKC) School of Law’s famous trials website at: http://law2.umkc.edu/faculty/projects/ftrials/ftrials.htm )

    4. About legalizing, vs. not, drugs: Reefer Madness is the classic propaganda film … and it was produced by a religious group! http://en.wikipedia.org/wiki/Reefer_madness

  14. “What makes the story pathetic is that Harkonen forgot when he got his 0.08 that the p-value is dependent on the model he picked. He could have picked another, one which gave him a smaller p-value. He could have kept searching for models until one issued a magic p-value. He might not have found one, but there’s so many different classical test statistics that it would have been worth looking.”

    He couldn’t have done that. You have to decide on your statistical methods before the clinical trial, not after.

  15. The “whistle blower” was Thomas Fleming, a well-known statistician who served on the data safety monitoring board. Of course, the board’s function was completed when it handed over the data to the company, but Fleming was incensed by the interpretation given to the data by Dr. Harkonen. Fleming was the government’s key witness, and he advanced his ultra-orthodox statistical views, tested only by cross-examination. Harkonen did not testify, and his counsel did not call an expert witness. After trial, statisticians Donald Rubin and Steve Goodman both filed affidavits in support of Harkonen, but the trial judge worked hard to uphold the conviction. The trial judge was however hard pressed to find anyone who was harmed by the press release, and she acknowledged that some may have been helped. She sentenced Harkonen to 6 months harm incarceration; the government wanted TEN YEARS in prison. Both sides appealed; and the 9th Circuit affirmed in a cursory opinion. Harkonen has asked the Court to take the case. The government has till next week to submit a brief to argue that the Court should not take the case.

    For more details on the statistical issues, I filed an amicus brief in the Supreme Court, with Professors Kenneth Rothman and Timothy Lash. See http://schachtmanlaw.com/wp-content/uploads/2010/03/KJR-TLL-NAS-Amicus-Brief-in-US-v-Harkonen-090413A.pdf

    Nathan

  16. Nathan S,

    Hey, thanks for this! I love your typo “harm incarceration.” What would Freud say? (“Have a cigar”, probably.)

    You guys should have called me. I would have loved to tell the judge that p-values are nutty (but I’d have to admit that I’m forced to use them by my clients).

    Here’s a link to Fleming.

  17. Ha! No sometimes typos are simply the result of attention deficits. (You don’t want to overinterpret the data now.) It was not easy putting the p-values and the causal inference into perspective for the Court. In Matrixx Initiatives v. Siracusano, the Court bought into sweeping, equally unacceptable pronouncements from the Solicitor General’s office (completely at odds with the prosecution/Fleming’s position in this case).

    NAS

  18. NAS,

    I didn’t know that case: will look it up.

    See this same channel on Sunday for the article “Everything Wrong With P-Values Under One Roof.”

  19. All,

    Everybody should read Schachtman’s brief, in which are found these familiar lines:

    Fleming;s view, based upon dichotomizing p-values into only two categories, ignores the continuity of p-values within the range 0 to 1, and ignores the widely held rejection of classifying complex biomedical studies into binary categories of success or failure by comparing their p-values with a standard of 0.05. Fleming’s interpretation would lead one to conclude that a treatment trial fails if p = 0.050001, but succeeds if p = 0.049999.

    It’s also so a trial is a failure at 0.05000000000000000001 but not 0.04999999999999999 (I didn’t count the 9s). Because why? Because 0.05 is magic.

    Update

    Oh my, what a mistake!

    “The government’s principal brief in the Ninth Circuit misstated the concept that is at the heart of this prosecution:”

    Generally, the significance of primary endpoint results is primarily expressed through the p-value, which is a number between 1 and 0. ER 43; SER 437. The lower the p-value, the greater the probability that the result reflected by the data is meaningful, and not due to chance. ER 43; SER 437-39. For example, a p-value of 0.05 indicates that the data obtained in the trial would occur by chance less than 5% of the time…

    No no no no no and as many nos as you like. It is false that “The lower the p-value, the greater the probability that the result reflected by the data is meaningful, and not due to chance.”

    I love this brief!

  20. Good call Briggs! I just assumed that harm incarceration was an obscure legal term, sort of like hard labor. The thing that I find strange about modern prison terms it that they seem so out of line with the severity of the crime (10 years for p-values in a public place?). My rule of thumb is that if the defendant gets more than a mob hit-man who has made a deal with the court, it is too long. I once read an article that claimed the advantages of flogging over incarceration for most crimes. It sounds medieval but in reality which would you prefer? Prison is not really humane but should be considered the punishment of last resort for the incorrigible.

  21. Thanks for the kind words. The brief was my pro bono publico project for this year. I teach a course in statistics and probability in the law at Columbia Law School, and I initially learned about the case when I reached out to Dr. Harkonen’s appellate counsel for briefs and for Professors Goodman and Rubin’s affidavits. Harkonen’s lawyer, Mark Haddad, and I then discussed the case at some length. After hearing what was going on, I agreed to put together an amicus brief.

  22. While I admit to not having read anything more than the post and the comments above, I can see one avenue where the courts may have a case. If (a big if) Harkonen had owned a number of shares in the company developing the drug, he could have profited financially from the press-release if the result was an increase in the companies share price.

    I don’t know if there was any potential for a financial gain (or even if one occurred), but people have been prosecuted for false statements intended to mislead investors.

    In that case, the issue would be less about the ‘p’-value and more about the intent of the press release to manipulate the stock price.

  23. Dr. Harkonen was the CEO of the company; I am sure he owned shares. The prosecution however was not for stock price manipulation; it was for Wire Fraud, which is rather different. I believe that there was a feeble attempt to sue the company for securities fraud, but the case was dismissed.

    As for Dr. Harkonen’s motive or intent, I urge you to read the brief. Of course, as CEO of the company, he had an apparent motive to interpret the evidence most favorably to his company and his interest, but that is the case with government and academic scientists as well. Every scientist is looking to the next grant, or tenure, or whatnot. We give a few examples of similar statements in the amicus brief. If such statements should be a crime, then the prisons will fill up with a better class of criminals.

    NAS

  24. Ironically, the difference between statistically significant and not statistically significant is itself not statistically significant.

Leave a Reply

Your email address will not be published. Required fields are marked *