New Paper Shows Statistical Errors At 50% to 100% In Peer-Reviewed Neuroscience Journals

The error is often reported like this: “The percentage of neurons showing cue-related activity increased with training in mutant mice (P < 0.05), but not in control mice (P > 0.05).”

Do you see it? That error, to be explained momentarily, was found in half the peer-reviewed neurology papers examined by Sander Nieuwenhuis, Birte Forstmann, and Eric-Jan Wagenmakers, in the paper “Erroneous analyses of interactions in neuroscience: a problem of significance” (link).

The researchers who made these statements wanted to claim that one effect (for example, the training effect on neuronal activity in mutant mice) was larger or smaller than the other effect (the training effect in control mice) To support this claim, they needed to report a statistically significant interaction (between amount of training and type of mice), but instead they reported that one effect was statistically significant, whereas the other effect was not. Although superficially compelling, the latter type of statistical reasoning is erroneous because the difference between significant and not significant need not itself be statistically significant.

If that is still confusing, here is a more detail. The trained mice have measurements before and after training; the control mice also have measurements before and after, but have no (or different) training. The error lies in reporting the “statistical significance” of the change in before and after means of trained mice, and in claiming the before and after means of the control mice was not “statistically significant.”

What should have been reported was the differences in changes in the before and after means between the two groups; that is, between the training and control mice. Stated another way, if the difference in means from the before and after measures in the training mice was T, and the difference in means from the before and after measures in the control mice was C, then the proper thing to do (classically) is to report on T – C.

The reason the other way is an error is because attaining “statistical significance” is easy (way too easy). The before and after difference for T can reach “significance”, while the before-after difference for C does not. But that does not mean that T – C will reach “significance.” Even if the before-after for C is “significant”, it still does not mean that T – C will reach “significance.” And so on.

The gist is that those people making the mistake go away more certain about their results than they should be. Results which are thought “significant” are not.

Nieuwenhuis et al. found 157 neurological papers which reported on differences: half of them made the error noted above. They then

reviewed an additional 120 cellular and [peer-reviewed] molecular neuroscience articles published in Nature Neuroscience in 2009 and 2010 (the first five Articles in each issue). We did not find a single study that used the correct statistical procedure to compare effect sizes. In contrast, we found at least 25 studies that used the erroneous procedure and explicitly or implicitly compared significance levels.

An error rate of 100% can’t be beat! But there’s more: “we found that the error also occurs when researchers compare correlations.” And then there’s the latest fad of electronic phrenology, where hardly a week passes before some important result is announced, the result based on different areas of the brain glowing between test and control subjects:

[T]he error of comparing significance levels is especially common in the [peer-reviewed] neuroimaging literature, in which results are typically presented in color-coded statistical maps indicating the significance level of a particular contrast for each (visible) voxel. A visual comparison between maps for two groups might tempt the researcher to state, for example, that “the hippocampus was signifcantly activated in younger adults, but not in older adults”. However, the implied claim is that the hippocampus is activated more strongly in younger adults than in older adults, and such a claim requires a direct statistical comparison of the effects.

The authors never answer the most important question: why? How can so many peers review and approve results which are wrong? I think the answer lies in the poor statistical training these scientists receive as graduate students.

The classwork most life scientists receive is to cram them into lecture halls and teach them cookbook formula, with an emphasis on calculations (by hand many times, just for the fun of it) and the math (because it is so pretty). Canned examples with pleasing results are used.

Understanding is given short shrift. Instead rules like, “You have data like this? Plug ’em into formula 32.” Only a lack of understanding can explain so silly a mistake our authors found. Researchers set out to prove that “training works” and then forget, mid analysis, what “works” means. The peers who review the paper also forget. Yet more evidence that statistical training needs to be completely revamped.


Thanks to Bruce Andrews who brought this topic to our notice.

Apropos peer-review: See this new column.


  1. We recently discussed a similar error in a paper:
    We have treatments A an B, with means 2.45 and 2.40, respectively, and the control C with mean 2.2. A is reported significantly better than C (P<0.05) but B is not (no p-value reported, my guess is ~0.05 too). So from now on we should only use A… which was not proved to be better than B anyway. In fact, to us, the rationale behind B is more compelling than A's.

    As to the stats training of reviewers, yeah! I think I read in your blog about the unfounded "at least 30-points sample or we won't publish" argument. And I could tell you some examples of people asking: "I ran correlations, regressions, t-tests…what else could I report?"… that is, that thing that you should think of *before* running the experiment.

    Anyway, nice reading as usual.

  2. But this isn’t a strictly statistical question–that’s the problem. This is more fundamental– a LOGIC question. They’re not failing at what test to use, they’re failing at determining what they’re supposed to prove. They’re focused on the means, when they should focus on the delta.

  3. I think it’s a problem of interpretation. In my example, A>C and B=C, so intuitively one might expect A>B as well. The extreme case would be X>Y, Y>Z and X>Z, but sometimes you might get X=Z, which breaks the intuitive transitivity. Simply put: you could prove X>Y, but not X=Y. Failing to reject the null hypothesis does not mean it is true. In fact, it rarely is in these cases.

  4. Good post.

    I agree with Brandon. A test statistic is actually a distance measure. So, I live 1 mile away (not significantly far) from the Strand Bookstore, and you live 2 miles away (significantly far, at least for me anyway) from it. It doesn’t necessarily mean we live far away from each other. Do we need to be a statistician like Mr. Briggs to understand the reason behind it? No.

    Of course, I’d say that the solution lies in collaborating with qualified and experienced statisticians because

    s t a t i s t i c i a ns r u l e.

    My objective opinion. ^_^

Leave a Comment

Your email address will not be published. Required fields are marked *