Many, even most, studies that contain a statistical component use *frequentist*, also called classical, techniques. The gist of those methods is this: data is collected, a probability model for that data is proposed, a function of the observed data—a *statistic*—is calculated, and then a thing called the *p-value* is calculated.

If the p-value is less than the **magic number** of 0.05, the results are said to be “statistically significant” and we are asked to believe that the study’s results are true.

I’ll not talk here in detail about p-values; but briefly, to calculate it, a belief about certain mathematical *parameters* (or indexes) of the probability models is stated. It is usually that these parameters equal 0. *If* the parameters truly are equal to 0, then the study is said to have no result. Roughly, the p-value is the probability of seeing another statistic (in infinite repetitions of the experiment) *larger than the statistic the researcher got in* this *study, assuming that the parameters in fact equal 0*.

For example, suppose we are testing the difference between a drug and a placebo. If there truly is no difference in effect between the two, i.e. the parameters are actually equal to 0, then 1 out of 20 times we did this experiment, we would expect to see a p-value less than 0.05, and so falsely conclude that there *is* a statistically significant difference between the drug and placebo. We would be making a mistake, and the published study would be false.

##### Is 1 out 20 a lot?

Suppose, as is true, that about 10,000 issues of medical journals are published in the world each year. This is about right to within an order of magnitude. The number may seem surprisingly large, but there are an enormous number of specialty journals, in many languages, hundreds coming out monthly or quarterly, so a total of 10,000 over the course of the year is not too far wrong.

Estimate that each journal has about 10 studies it is reporting on. That’s about right, too: some journals reports dozens, others only one or two; the average is around 10.

So that’s 10,000 x 10 = 100,000 studies that come out each year, in medicine alone.

If all of these used the p-value method to decide significance, then about 1 out of 20 studies will be falsely reported as true, thus about 5000 studies will be reported as true but will actually be false. And these will be in the best journals, done by the best people, and taking place at the best universities.

It’s actually worse than this. Most published studies do not have just one result which is report on (and found by p-value methods). Typically, if the main effect the researchers were hoping to find is insignificant, the search for other interesting effects in the data is commenced. Other studies look for more than one effect by design. Plus, for all papers, there are usually many subsidiary questions that are asked of the data. It is no exaggeration, then, to estimate that 10 (or even more) questions are asked of each study.

Let’s imagine that a paper will report a “success” if just one of the 10 questions gives a p-value less than the **magic number**. Suppose for fun that, every question in every study in every paper is false. We can then calculate the chance that a given paper **falsely reports success**: it is just over **40%**.

This would means that about 40,000 out of the 100,000 studies each year would falsely claim success!

That’s too high a rate for actual papers—after all, many research questions are asked which have a high prior probability of being true—but the 5000 out of 100,000 is also too low because the temptation to go fishing in the data is too high.? It is far too easy to make these kinds of mistakes using classical statistics.

The lesson, however, is clear: read all reports, especially in medicine, with a skeptical eye.

Bill,

I am surprised that there is even the likelihood that medical reports ie peer reviewed papers, would have such a high error rate even at the rate of 1:20, or about 5000 in total.

Doesnt say much for the peer review process,when lives are at risk.Imagine what it would be for AGW science

Cheers

Malcolm Hill

Adelaide

Malcom,

That error rate is a direct consequence of using the p-value approach. It is because it is true that there is a 1 in 20 chance of seeing a statistic as large or larger (in absolute value) than the one you happened to get,

giventhat no effect exists.You cannot guard against it, either, in the peer review process. After all, both the people who wrote the paper and the reviewer see a p-value less than the magic number. So both sides assume that the effect is real, and the data you have do seem to support it (using classical statistics).

Nowhere in classical statistics can you answer the question: “What is the probability that the effect is real?” That always surprises people, but I’m afraid it’s true.

I’ll soon post an article about how you can make even bigger mistakes if you follow a very standard procedure, one that is almost guaranteed to give you a p-value less than the magic number, but where all the data is completely made up.

(What’s “AGW”?)

Briggs

Bill

AGW =Anthroprogenic Global Warming

Look forward to the post referred to above.

Methinks that one should read all reports with a sceptial eye, not just medical.

Cheers

I see a couple of issues in the 1 in 20 claim. First, the 1 in 20 would apply to proportion of all studies (published or not) where the true effect is 0 (this ignores your good point that many studies look at multiple outcomes). Second, the p-value is a conditional probability, and the wrong one. You really want the probability that the result is false, given that it is published.

Ill get to the second point below, but the as for the first: when you limit the analysis to published studies, all the p-values are less than .05. In studies published in major journals, I would imagine most of them are much lower than 0.05, thus indicating a conditional probability of a lot less than 1 in 20.

Let’s take the lead article of the current issue of the NE journal of Medicine. This article (see http://content.nejm.org/cgi/content/short/358/1/9 ) talks about the effect of delayed time to defibrillation. The total sample size is more than 6,000, of whom more than 2,000 had delayed defib. The p-value is well below 1 in 10,000. NEJM studies are certainly more likely to be highly significant, so I am not suggesting it is indicative, but we do tend to have more faith (yes I saw your post on this word) in journals like NEJM, and one of the reasons why is that they tend not to publish borderline results.

My next point has to do with which probability to measure. The p-value is the probability of getting a result at least as different from the “no effect” result as the one obtained, given that the true state of the world is that there is no effect (I can work to get the language better, but we both know the definition so I’ll move on). What we are really looking for is the probability that there is no effect, given the study is published. I tend to think that in better reviewed journals, reviewers will be wary to publish startling new results without some outside proof or even independent replication. Even ignoring this, the proabbility relies in part on what percentage of research studies are of something that truly has no effect. If we assume this is one-half, then around 1 in 20 studies published have spurious results, supporting your claim (consider 1000 studies, 500 with no effect and 500 with an effect; assuming a “power” of 95%, about 475 of the “with effect” studies will get published and at an alpha of 5% about 25 of the no effect studies will get published). But in order to do these studies, you have to get funding, etc, which is difficult to do. If this extra layer of difficulty means that, in fact, 80% of research is on something that has a non-zero effect, then we have closer to a 1% chance that any given published study has incorrectly statistically significant results.

-Alan Salzberg

Thanks for your insightful comments, Alan.

You’re quite right that the calculations I did ignored—purposely—the prior probabilities of the studies being true. I wanted to highlight the non-intuitive, and even surprising, result that so many false studies could be published given that all effects are actually zero. But you’re right to stress that this is not the whole answer to the question.

That the p-value

isa conditional probability is nearly always under-appreciated. To emphasize (for our readers): it is the probability of seeing a larger statistic than what you gotgiventhe effect is zero. But it is impossible, even forbidden, in classical statistics to calculate what you actually want to know: (1) the probability the effect is non-zerogiventhe actual data observed, and what you rightly point out (2) the probability the result is falsegiventhat it is published. P-values simply cannot answer these questions. What is shocking, however, is that you are not even allowed to ask them in the first place! (In classical, frequentist statistics.)Now, you mention a particular article in which “The p-value is well below 1 in 10,000” which is “highly significant.” There is trouble here, because all we know is that we are unlikely to see a particular value of some statistic given the effect is 0. Unfortunately, a low p-value does

notalways imply a high probability that the effect is non-zero.Regarding your last point, i.e., that about “a 1% chance that any given published study has incorrectly statistically significant results”, there has been some work on this in the published literature; I think, the

British Medical Journaland also the on-linePLOS: Medicine. I’ll have to dig up these articles to quote more exactly, but one of these guessed that as many as 50% of all studies are incorrect! I think that estimate is high, but only because I am an optimist.Though I’ll soon make a posting about a standard analysis practice that is bound to give a false result, so perhaps I should be more of a pessimist.

I actually found one of the studies fairly quickly. It is by John P. A. Ioannidis and can be found in the Public Library of Science: Medicine.

Here is his abstract:

There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

The conceptual problem, of course, is the perceived truth. Even if we have a rigorous study, with a large, random data set, and even if the alpha is 0.01, statistical significance is still probability. An honest (or thinking) statistician should always make this clear to reporters.

One thing to consider however, seriously, is that there is more value to a report than just the study’s hypothesis claim. For instance, the data has value. The approach, if novel, has value.