December 26, 2007 | 8 Comments
Many, even most, studies that contain a statistical component use frequentist, also called classical, techniques. The gist of those methods is this: data is collected, a probability model for that data is proposed, a function of the observed data—a statistic—is calculated, and then a thing called the p-value is calculated.
If the p-value is less than the magic number of 0.05, the results are said to be “statistically significant” and we are asked to believe that the study’s results are true.
I’ll not talk here in detail about p-values; but briefly, to calculate it, a belief about certain mathematical parameters (or indexes) of the probability models is stated. It is usually that these parameters equal 0. If the parameters truly are equal to 0, then the study is said to have no result. Roughly, the p-value is the probability of seeing another statistic (in infinite repetitions of the experiment) larger than the statistic the researcher got in this study, assuming that the parameters in fact equal 0.
For example, suppose we are testing the difference between a drug and a placebo. If there truly is no difference in effect between the two, i.e. the parameters are actually equal to 0, then 1 out of 20 times we did this experiment, we would expect to see a p-value less than 0.05, and so falsely conclude that there is a statistically significant difference between the drug and placebo. We would be making a mistake, and the published study would be false.
Is 1 out 20 a lot?
Suppose, as is true, that about 10,000 issues of medical journals are published in the world each year. This is about right to within an order of magnitude. The number may seem surprisingly large, but there are an enormous number of specialty journals, in many languages, hundreds coming out monthly or quarterly, so a total of 10,000 over the course of the year is not too far wrong.
Estimate that each journal has about 10 studies it is reporting on. That’s about right, too: some journals reports dozens, others only one or two; the average is around 10.
So that’s 10,000 x 10 = 100,000 studies that come out each year, in medicine alone.
If all of these used the p-value method to decide significance, then about 1 out of 20 studies will be falsely reported as true, thus about 5000 studies will be reported as true but will actually be false. And these will be in the best journals, done by the best people, and taking place at the best universities.
It’s actually worse than this. Most published studies do not have just one result which is report on (and found by p-value methods). Typically, if the main effect the researchers were hoping to find is insignificant, the search for other interesting effects in the data is commenced. Other studies look for more than one effect by design. Plus, for all papers, there are usually many subsidiary questions that are asked of the data. It is no exaggeration, then, to estimate that 10 (or even more) questions are asked of each study.
Let’s imagine that a paper will report a “success” if just one of the 10 questions gives a p-value less than the magic number. Suppose for fun that, every question in every study in every paper is false. We can then calculate the chance that a given paper falsely reports success: it is just over 40%.
This would means that about 40,000 out of the 100,000 studies each year would falsely claim success!
That’s too high a rate for actual papers—after all, many research questions are asked which have a high prior probability of being true—but the 5000 out of 100,000 is also too low because the temptation to go fishing in the data is too high.? It is far too easy to make these kinds of mistakes using classical statistics.
The lesson, however, is clear: read all reports, especially in medicine, with a skeptical eye.