Wall Street Journal: Better than a statistics textbook.

On Thursday 14 August, the Wall Street Journal had two excellent articles, which expertly described the statistics and uncertainty of their topics. Several readers have wrote in asking for an analysis of these articles.

1. The first was by Thomas M. Burton: “New Therapy for Sepsis Infections Raises Hope but Many Questions.” Sepsis is a nasty disease that often is the result of other trauma or infection, and is often deadly. Curing it is difficult; usually a third or more of the patients who contract it die. So when a study published by Emanuel Rivers, a doc in the emergency medicine department at Henry Ford Hospital in Detroit, appeared with a new therapy that seemed to have a higher cure rate than traditional therapy, doctors were excited. (Incidentally, I made my first appearance at that same hospital.)

But a few were skeptical. The “questions” in Burton’s title hinge on the statistical methods used in the journal article—which was published in the most prestigious medical journal. Turns out that Rivers did not use all the patients he had entered into his study in the actual statistical analysis. “Statisticians were especially concerned when they noticed that a relatively high proportion of the other 25 — those not included in the final analysis — were either conventional-therapy patients who survived or patients on aggressive therapy who died.”

Why were these patients left out of the analysis? Well, doctor judgment: these 25 patients were not evaluated, at the time, to be as sick, so they were left out. In medical statistics, there is a concept called intent-to-treat, and it means that you must analyze the data putting all patients into the groups that you first put them in no matter what. This procedure is meant to guard against the experimenter effect, which is the boost in results got by the researcher when he, consciously or not, fiddles the patient rolls to get his desired result.

Why wasn’t the original paper criticized on these grounds? A peer-reviewed paper, I should emphasize. Are we saying it is possible that published research could be wrong?

2. Thanks to reader Gabe Thornhill for pointing out another excellent piece by Keith J. Winstein for his article “Boston Scientific Stent Study Flawed.” You might remember Mr Winstein was the only reporter to get the story about boys’ and girl’s mathematical abilities correct.

The story is that Boston Scientific (BS) introduced a new stent, which is an artificial pipe support that is stuck in blood vessels to keep them from being choked off by gunk, called the Taxsus Liberte. BS did the proper study to show the stent worked, but analyzed their data in a peculiar way.

Readers of this blog might remember Chapter 14 of my book: How to Cheat. In classical statistics, an excellent way to cheat, and a method you can almost always get away with, is to change your test statistic so you get the p-value you desire. For any set of data there are dozens of test statistics from which to choose. Each of them will give you a different p-value. For no good reason, the p-value has to be less than 0.05 for a success. So what you do is keep computing different statistics until you find the one which gives you the lowest p-value.

This trick nearly always works, too. It carries a double-bang, because not only can you nearly always find a publishable p-value, nobody can ever remember the actual definition of a p-value. Smaller p-values are usually accompanied with the claim that the results “stronger” or “more significant”. False, of course, but since everybody says so you will be in good company.

Actually, Mr Winstein has two definitions in his piece that aren’t quite right. The first:

Using a standard probability measure known as the “p-value”, it said that there was less than a 5% chance that is finding was wrong

and

[S]cience traditionally requires 95% certainty that a study proved its premise.

Pay attention. Here is the actual definition of a p-value, adapted to the stent study. For the two sets of data, one for the BS stent, one for another stent, posit a probability distribution which describes your uncertainty in the measures resulting from using these stents. These probability distributions have parameters, which are unknown unobservable numbers that are needed to fully specify the probability distributions.

Now, ignore some of these parameters, and concentrate of just one from each distribution (one for the BS stent, one for the other) and then say that one parameter for the BS stent is exactly equal to the parameter for the other stent. Then calculate a statistic. From above, we know we have the choice of several—and Mr Winstein has an excellent graph showing some possible choices. Here comes the p-value. It is the probability that, if you repeated the same experiment an infinite number of times, that you would see a statistic as larger or larger than the one you actually got given those two parameters you picked were exactly equal.

Make sense? Or is it confusing? I’d say the later. One thing you cannot say is that, for example with a p-value of 0.04, there is a 96% chance that the two stents are the same (BS sought to say their stent was equivalent to its competitor’s). Nor can you say there is a 4% chance you are wrong. All you can say is that there is a 4% chance that if you repeated the experiment many times, each time calculating the same statistic, than one of those statistics would be larger than the one you got (again, given the two parameters are exactly equal).

Whew. A lot of work to get to this point, I agree. But this is it, because nobody—even professorial classical statisticians, which we’ll see in a moment—can actually remember this definition. This is what makes it possible to cheat.

Boston Scientific used something called a Wald test, which is way to approximate the p-value, because often p-values cannot be computed exactly. It is well known, however, that this method gives poor approximations and often gives p-values that are smaller than they should be. However, all this is conditional on the test statistic used being correct, and on the probability distributions chosen for the observable data being correct, and on the parameters you ignored to set up the p-value being ignorable, always huge assumptions. This is why it is strange to see, near the very end of the article, a professor of statistics say that the imperfect Wald method is commonly used but that

Most statisticians would accept this approximation. But since this was right on the border [meaning the p-value was barely under the magic number], greater scrutiny reveals that the true, the real, p-value was slightly more than 5%

The true, the real? The problem here is there is no true or real p-value. Each of the p-values computed by using the different statistics is the true, real one. This is one of the main problems with classical statistics. Another is the persnickety insistence on exactly 0.05 as the cutoff. Got a p-value of 0.050000001? Too bad, chum. Have a 0.0499999999 instead? Success! It’s silly.

Obviously, misinterpreting p-values is a big problem. But ignore that. Winstein and the WSJ have done a wonderful job summarizing a difficult topic. Are you ready for this? They actually got the data and recomputed the statistical tests themselves! This is real science reporting. It must have taken them a lot of effort. If only more journalists would put in half as much work as Mr Winstein, we’d have eighty percent less junk being reported as “news.” In short, read Winstein’s article. He has quotes from Larry Brown, one of the top theoretical statisticians alive, and comments from officials at the FDA about why these kinds of studies are accepted or not.

lucia

August 16, 2008, 4:04 pm

Brian–

Actually, for the tests I happen to be doing, for data from most measurement groups, we are no where near the threshold for the cut-off of p=0.05, using a classic test. (With GISS temp, we aren’t falsifying at all though. ) HadCrut in particular is well, well past p=0.05 using that test.

From what I can tell, different people are not perturbed for different reasons. The primary ones I am aware of are:

1) There is no particularly strong phenomenological theory saying weather noise must be AR(1). The statistical tests I do assume the noise is AR(1).

I actually agree with this criticism generally. However, until such time as someone either proposes another statistical model that makes sense or shows that AR(1) is inconsistent with the data during periods free of stratopheric volcanic eruptions, I’m going with that one. (I take the same point of view in treating the residuals as normally distributed. Unless someone can show they aren’t or suggest a reason they should do something else, I’m assuming that.)

2) Some believe we must take the variability of trends calculated over 90 months based on the variability in ensemble of model runs. The variability is twice the variability estimated using OLS corrected for red noise (i.e. AR(1) noise)

This argument might be reasonable if models do get the spectral properties of weather sufficiently well to give us a reasonably correct distribution trends calculated over 90 months. (Or, at least we can’t discount that idea if we can’t prove they don’t. )

I’m in the process of looking at the model runs and looking at a few questions we can answer looking at the period of time during the 20s and 30s when volcanos weren’t erupting and then also looking at the data in the recent period. It looks like the models over estimated the variability of 10 year trends back noticably. But I haven’t finished downloading all the data, so I can’t say anything definitive about that.

3) Some suggest we specifically need to consider El Nino as the “non-AR(1)” feature. However, if we use the correction Gavin discussed in his post correcting for El Nino and applying that to the data, it the “El Nino” corrected data ends up “falsifying” worse. (There are problems with doing this correction. But since the issue was raised, I showed what we would get if we just assume we can correct the data as Gavin described in his post. BTW: Gavin wasn’t the originator of the correction. He got it from another paper.)

As for other reasons some aren’t perturmed: You’ll probably have to ask each person what specifically they don’t like. But the “biggies” are really (1) and (2) above. I’ve been paying attention to the various criticisms, and I’m trying to do things to deal with those that make sense. But… alas… I am not a real statistician. I’m a mechanical engineer. 🙂

5 Comments

tesla

August 16, 2008, 8:54 am

I have a lot of experience in meeting arbitrary thresholds. In engineering you have a lot more freedom than with medical studies. If you need to meet a threshold but are slightly over (or under, depending on which side is good) there are many tricks you can pull. Sometimes rounding and truncating numbers at the correct decimal place is enough. Smoothing the data with certain assumptions about the noise can also work wonders, and no one can say much about the filtering process unless they know a lot about the noise mechanism.

The best way to meet arbitrary thresholds if you’re far away is to throw out “bad” data, so long as you can tack on a convincing sounding argument about why it would be idiotic to include the “bad” figures. When explaining why you threw out the bad data be sure to include lots of “obviously” and “clearly” descriptors. These seem to greatly reduce the probability of getting called on the data removal.

If you can’t meet spec despite all of this the best thing is to invent a new statistic where you do much better and mount a vigorous offensive as to why this is really what matters. Boston Scientific was constrained by the dictates of their field to work with p-values so there wasn’t much they could do except change the criteria (Wald, etc.).

I’ve just scratched the surface here. There are countless things you can do to meet the numbers without lying at all.

Clearly I’ve never actually done any of this, and none of it has ever been published in refereed journals 🙂
Brian

August 16, 2008, 11:21 am

“Another is the persnickety insistence on exactly 0.05 as the cutoff. Got a p-value of 0.050000001? Too bad, chum. Have a 0.0499999999 instead? Success! Itâ€™s silly.”

Is this outlook the general reason AGW proponents are unperturbed by findings from stat bloggers such as LUCIA that IPCC’s projections “falsify”? For many (if not most), 90% certainty is as good as 95%? The “cause” (e.g., save the planet) requires more flexibility, and “drawing the line” becomes statistically more irrelevant?

BRIAN M FLYNN
lucia

August 16, 2008, 4:04 pm

Brian–

Actually, for the tests I happen to be doing, for data from most measurement groups, we are no where near the threshold for the cut-off of p=0.05, using a classic test. (With GISS temp, we aren’t falsifying at all though. ) HadCrut in particular is well, well past p=0.05 using that test.

From what I can tell, different people are not perturbed for different reasons. The primary ones I am aware of are:

1) There is no particularly strong phenomenological theory saying weather noise must be AR(1). The statistical tests I do assume the noise is AR(1).

I actually agree with this criticism generally. However, until such time as someone either proposes another statistical model that makes sense or shows that AR(1) is inconsistent with the data during periods free of stratopheric volcanic eruptions, I’m going with that one. (I take the same point of view in treating the residuals as normally distributed. Unless someone can show they aren’t or suggest a reason they should do something else, I’m assuming that.)

2) Some believe we must take the variability of trends calculated over 90 months based on the variability in ensemble of model runs. The variability is twice the variability estimated using OLS corrected for red noise (i.e. AR(1) noise)

This argument might be reasonable if models do get the spectral properties of weather sufficiently well to give us a reasonably correct distribution trends calculated over 90 months. (Or, at least we can’t discount that idea if we can’t prove they don’t. )

I’m in the process of looking at the model runs and looking at a few questions we can answer looking at the period of time during the 20s and 30s when volcanos weren’t erupting and then also looking at the data in the recent period. It looks like the models over estimated the variability of 10 year trends back noticably. But I haven’t finished downloading all the data, so I can’t say anything definitive about that.

3) Some suggest we specifically need to consider El Nino as the “non-AR(1)” feature. However, if we use the correction Gavin discussed in his post correcting for El Nino and applying that to the data, it the “El Nino” corrected data ends up “falsifying” worse. (There are problems with doing this correction. But since the issue was raised, I showed what we would get if we just assume we can correct the data as Gavin described in his post. BTW: Gavin wasn’t the originator of the correction. He got it from another paper.)

As for other reasons some aren’t perturmed: You’ll probably have to ask each person what specifically they don’t like. But the “biggies” are really (1) and (2) above. I’ve been paying attention to the various criticisms, and I’m trying to do things to deal with those that make sense. But… alas… I am not a real statistician. I’m a mechanical engineer. 🙂
Joe Triscari

August 17, 2008, 11:10 pm

My impression of the stent study is that while they discuss the some of the issues well, there is a journalist overreaction going on.

I’m not part of the pharmaceutical industry (my father who has the same name is a biochemist so don’t be confused). My impression is that the FDA requires that the experiment and statistical tests be identified before the experiment starts and that any deviations have to be justified through a fairly rigorous process. So I don’t think BS could have been fishing for a result by changing the test after the case. Looking at the article it appears BS claims they followed the process.

If they didn’t follow the procedures, this is a big problem.

The FDA statisticians are not naive. They understand the risks of dredging for results in a sea of data. They have procedures to prevent it.

Given the FDA quotes I don’t think the WSJ uncovered anything like malfeasance more like they are pointing at the exotic test and saying, “Why?” They answer their own question: The FDA wants to be more lenient on tests with devices than tests with drugs and companies want to maximize the chance of approval.

They should be asking, “Is that good FDA policy?” I don’t know. I suppose the more up front way would be to say, “We’re going to allow you to reject the null hypothesis when there’s a 3 in 50 chance of doing so mistakenly when you are testing devices but you have to use the usual tests.” That could lead to other problems like people trying to get drug procedures classified as “devices” on the thinest of excuses.

I don’t think it’s a statistics problem although it does raise a legitimate policy concern.
Pingback: Recent Links Tagged With "procedure" - JabberTags

Wall Street Journal: Better than a statistics textbook.

Related

5 Comments

Leave a Reply

Share this:

Related

5 Comments

Leave a Reply