August 16, 2008 | 5 CommentsOn Thursday 14 August, the *Wall Street Journal* had two excellent articles, which expertly described the statistics and uncertainty of their topics. Several readers have wrote in asking for an analysis of these articles.

**1.** The first was by Thomas M. Burton: “New Therapy for Sepsis Infections Raises Hope but Many Questions.” Sepsis is a nasty disease that often is the result of other trauma or infection, and is often deadly. Curing it is difficult; usually a third or more of the patients who contract it die. So when a study published by Emanuel Rivers, a doc in the emergency medicine department at Henry Ford Hospital in Detroit, appeared with a new therapy that seemed to have a higher cure rate than traditional therapy, doctors were excited. (Incidentally, I made my first appearance at that same hospital.)

But a few were skeptical. The “questions” in Burton’s title hinge on the statistical methods used in the journal article—which was published in the most prestigious medical journal. Turns out that Rivers did not use all the patients he had entered into his study in the actual statistical analysis. “Statisticians were especially concerned when they noticed that a relatively high proportion of the other 25 — those not included in the final analysis — were either conventional-therapy patients who survived or patients on aggressive therapy who died.”

Why were these patients left out of the analysis? Well, doctor judgment: these 25 patients were not evaluated, at the time, to be as sick, so they were left out. In medical statistics, there is a concept called *intent-to-treat*, and it means that you must analyze the data putting all patients into the groups that you first put them in no matter what. This procedure is meant to guard against the *experimenter effect*, which is the boost in results got by the researcher when he, consciously or not, fiddles the patient rolls to get his desired result.

Why wasn’t the original paper criticized on these grounds? A peer-reviewed paper, I should emphasize. Are we saying it is possible that published research could be wrong?

**2.** Thanks to reader Gabe Thornhill for pointing out another excellent piece by Keith J. Winstein for his article “Boston Scientific Stent Study Flawed.” You might remember Mr Winstein was the only reporter to get the story about boys’ and girl’s mathematical abilities correct.

The story is that Boston Scientific (BS) introduced a new stent, which is an artificial pipe support that is stuck in blood vessels to keep them from being choked off by gunk, called the Taxsus Liberte. BS did the proper study to show the stent worked, but analyzed their data in a peculiar way.

Readers of this blog might remember Chapter 14 of my book: How to Cheat. In classical statistics, an excellent way to cheat, and a method you can almost always get away with, is to *change your test statistic so you get the p-value you desire*. For any set of data there are dozens of test statistics from which to choose. Each of them will give you a different p-value. For no good reason, the p-value has to be less than 0.05 for a success. So what you do is keep computing different statistics until you find the one which gives you the lowest p-value.

This trick nearly always works, too. It carries a double-bang, because not only can you nearly always find a publishable p-value, nobody can ever remember the actual definition of a p-value. Smaller p-values are usually accompanied with the claim that the results “stronger” or “more significant”. False, of course, but since everybody says so you will be in good company.

Actually, Mr Winstein has two definitions in his piece that aren’t quite right. The first:

Using a standard probability measure known as the “p-value”, it said that there was less than a 5% chance that is finding was wrong

and

[S]cience traditionally requires 95% certainty that a study proved its premise.

**Pay attention**. Here is the actual definition of a p-value, adapted to the stent study. For the two sets of data, one for the BS stent, one for another stent, posit a probability distribution which describes your uncertainty in the measures resulting from using these stents. These probability distributions have *parameters*, which are unknown unobservable numbers that are needed to fully specify the probability distributions.

Now, ignore some of these parameters, and concentrate of just one from each distribution (one for the BS stent, one for the other) and then say that one parameter for the BS stent is *exactly equal to* the parameter for the other stent. Then calculate a *statistic*. From above, we know we have the choice of several—and Mr Winstein has an excellent graph showing some possible choices. Here comes the p-value. It is the probability that, if you repeated the same experiment an infinite number of times, that you would see a statistic as larger or larger than the one you actually got *given* those two parameters you picked *were* exactly equal.

Make sense? Or is it confusing? I’d say the later. One thing you *cannot* say is that, for example with a p-value of 0.04, there is a 96% chance that the two stents are the same (BS sought to say their stent was equivalent to its competitor’s). Nor can you say there is a 4% chance you are wrong. All you can say is that there is a 4% chance that if you repeated the experiment many times, each time calculating the same statistic, than one of those statistics would be larger than the one you got (again, given the two parameters are exactly equal).

Whew. A lot of work to get to this point, I agree. But this is it, because nobody—even professorial classical statisticians, which we’ll see in a moment—can actually remember this definition. This is what makes it possible to cheat.

Boston Scientific used something called a Wald test, which is way to approximate the p-value, because often p-values cannot be computed exactly. It is well known, however, that this method gives poor approximations and often gives p-values that are smaller than they should be. However, all this is conditional on the test statistic used being correct, and on the probability distributions chosen for the observable data being correct, and on the parameters you ignored to set up the p-value being ignorable, always *huge* assumptions. This is why it is strange to see, near the very end of the article, a professor of statistics say that the imperfect Wald method is commonly used but that

Most statisticians would accept this approximation. But since this was right on the border [meaning the p-value was barely under the magic number], greater scrutiny reveals that the true, the real, p-value was slightly more than 5%

The true, the real? The problem here is there is no true or real p-value. Each of the p-values computed by using the different statistics is the true, real one. This is one of the main problems with classical statistics. Another is the persnickety insistence on *exactly* 0.05 as the cutoff. Got a p-value of 0.050000001? Too bad, chum. Have a 0.0499999999 instead? Success! It’s silly.

Obviously, misinterpreting p-values is a big problem. But ignore that. Winstein and the *WSJ* have done a wonderful job summarizing a difficult topic. Are you ready for this? They actually got the data and recomputed the statistical tests themselves! *This* is real science reporting. It must have taken them a lot of effort. If only more journalists would put in half as much work as Mr Winstein, we’d have eighty percent less junk being reported as “news.” In short, read Winstein’s article. He has quotes from Larry Brown, one of the top theoretical statisticians alive, and comments from officials at the FDA about why these kinds of studies are accepted or not.