Update The paper on which this post was based has hit the presses which was used as an occasion for Nature to opine that, just maybe, that some of the softer sciences have a problem with replication. I thought it important enough to repost (the original was on 3 March 2012).
Nature’s article is Replication studies: Bad copy: In the wake of high-profile controversies, psychologists are facing up to problems with replication. The meat is found in these two paragraphs.
Positive results in psychology can behave like rumours: easy to release but hard to dispel. They dominate most journals, which strive to present new, exciting research. Meanwhile, attempts to replicate those studies, especially when the findings are negative, go unpublished, languishing in personal file drawers or circulating in conversations around the water cooler. “There are some experiments that everyone knows don’t replicate, but this knowledge doesn’t get into the literature,” says Wagenmakers. The publication barrier can be chilling, he adds. “I’ve seen students spending their entire PhD period trying to replicate a phenomenon, failing, and quitting academia because they had nothing to show for their time.”
These problems occur throughout the sciences, but psychology has a number of deeply entrenched cultural norms that exacerbate them. It has become common practice, for example, to tweak experimental designs in ways that practically guarantee positive results. And once positive results are published, few researchers replicate the experiment exactly, instead carrying out ‘conceptual replications’ that test similar hypotheses using different methods. This practice, say critics, builds a house of cards on potentially shaky foundations.
Ed Yong, who wrote this piece, also opines on fraud, especially the suspicion that this activity has been increasing. Yong’s piece, the Simmons et al. paper, and the post below are all well worth reading.
Thanks to James Glendinning for the head’s up.
My heart soared like a hawk when I read Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn’s “False-Positive Psychology : Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant“, published in Psychological Science1.
From their abstract:
[W]e show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We…demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis.
Preach it, brothers! Sing it loud and sing it proud. Oh, how I wish that your colleagues will take your admonitions to heart and abandon the Cult of the P-value!
Rarely have I read such a quotable paper. False positives—that is, false “confirmations” of hypotheses—are “particularly persistent”; “because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them.” False positives “can lead to ineffective policy changes.”
Many false positives are found because of the “researcher’s desire to find a statistically significant result.” That researchers are “are self-serving in their interpretation of ambiguous information and remarkably adept at reaching
justifiable conclusions that mesh with their desires” “Ambiguity is rampant in empirical research.”
Our goal as scientists is not to publish as many articles as we can, but to discover and disseminate truth.
I am man enough to admit that I wept when I read those words.
The authors include a hilarious—actual—study where they demonstrate that listening to a children’s song makes people younger. Not just feel younger, but younger chronologically. Is there nothing statistics cannot do?
They first had two groups listen to an adult song or a children’s song and then asked participants how old they felt afterwards. They also asked participants for their ages and their fathers’ ages “allowing us to control for variation in baseline age across participants.” They got a p-value of 0.033 “proving” that listening to the children’s song made people feel younger.
They then forced the groups to listen to a Beatles song or the same children’s song (they assumed there was a difference), and again asked the ages. “We used father’s age to control for variation in baseline age across participants.”
According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to [the children's song] (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040.
Ha ha! How did they achieve such a deliciously publishable p-value for a necessarily false result? Because of the broad flexibility in classical statistics which allows users to “data mine” for the small p-values.
Ways to Cheat
The authors list six major mistakes that users of statistics make. They themselves used many of these mistakes in “proving” the results in the experiment above. (We have covered all of these before: click Start Here and look under Statistics.)
“1.Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.” If not, it is possible to use a stopping rule which guarantees a publishable p-value: just stop when the p-value is small!
“2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.” Small samples are always suspicious. Were they the result of just one experiment? Or the fifth, discarding the first four as merely warm ups?
“3. Authors must list all variables collected in a study.” A lovely way to cheat is to cycle through dozens and dozens of variables, only reporting the one(s) that are “significant.” If you don’t report all the variables you tried, you make it appear that you were looking for the significant effect all along.
“4. Authors must report all experimental conditions, including failed manipulations.” Self explanatory.
“5. If observations are eliminated, authors must also report what the statistical results are if those observations are included.” Or: there are no such things as outliers. Tossing data that does not fit preconceptions always skews the results toward a false positive.
“6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.” This is a natural mate for rule 3.
The authors also ask that peer reviewers hold researchers’ toes to the fire: “Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.”
Our trio is also suspicious of Bonferroni-type corrections, seeing these as yet another way to cheat after the fact. And it is true that most statistics textbooks say design your experiment and analysis before collecting data. It’s just that almost nobody ever follows this rule.
Bayesian statistics also doesn’t do it for them, because they worry that it increases the researchers’ “degrees of freedom” in picking the prior, etc. This isn’t quite right, because most common frequentist procedures have a Bayesian interpretation with “flat” prior.
Anyway, the real problem isn’t Bayes versus frequentist. It is the mania for quantification that is responsible for most mistakes. It is because researchers quest for small p-values, and after finding them confuse them for holy grails, that they wander into epistemological thickets.
Now how’s that for a metaphor!
Thanks to reader Richard Hollands for suggesting this topic and alerting me to this paper.