How To Present Anything As Significant. Update: Nature Weighs In

Update The paper on which this post was based has hit the presses which was used as an occasion for Nature to opine that, just maybe, that some of the softer sciences have a problem with replication. I thought it important enough to repost (the original was on 3 March 2012).

Nature’s article is Replication studies: Bad copy: In the wake of high-profile controversies, psychologists are facing up to problems with replication. The meat is found in these two paragraphs.

Positive results in psychology can behave like rumours: easy to release but hard to dispel. They dominate most journals, which strive to present new, exciting research. Meanwhile, attempts to replicate those studies, especially when the findings are negative, go unpublished, languishing in personal file drawers or circulating in conversations around the water cooler. “There are some experiments that everyone knows don’t replicate, but this knowledge doesn’t get into the literature,” says Wagenmakers. The publication barrier can be chilling, he adds. “I’ve seen students spending their entire PhD period trying to replicate a phenomenon, failing, and quitting academia because they had nothing to show for their time.”

These problems occur throughout the sciences, but psychology has a number of deeply entrenched cultural norms that exacerbate them. It has become common practice, for example, to tweak experimental designs in ways that practically guarantee positive results. And once positive results are published, few researchers replicate the experiment exactly, instead carrying out ‘conceptual replications’ that test similar hypotheses using different methods. This practice, say critics, builds a house of cards on potentially shaky foundations.

Ed Yong, who wrote this piece, also opines on fraud, especially the suspicion that this activity has been increasing. Yong’s piece, the Simmons et al. paper, and the post below are all well worth reading.

Thanks to James Glendinning for the head’s up.

———————————————————————————————————

false positives My heart soared like a hawk when I read Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn’s “False-Positive Psychology : Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant“, published in Psychological Science1.

From their abstract:

[W]e show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We…demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis.

Preach it, brothers! Sing it loud and sing it proud. Oh, how I wish that your colleagues will take your admonitions to heart and abandon the Cult of the P-value!

Rarely have I read such a quotable paper. False positives—that is, false “confirmations” of hypotheses—are “particularly persistent”; “because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them.” False positives “can lead to ineffective policy changes.”

Many false positives are found because of the “researcher’s desire to find a statistically significant result.” That researchers are “are self-serving in their interpretation of ambiguous information and remarkably adept at reaching
justifiable conclusions that mesh with their desires” “Ambiguity is rampant in empirical research.”

Our goal as scientists is not to publish as many articles as we can, but to discover and disseminate truth.

I am man enough to admit that I wept when I read those words.

Chronological Rejuvenation

The authors include a hilarious—actual—study where they demonstrate that listening to a children’s song makes people younger. Not just feel younger, but younger chronologically. Is there nothing statistics cannot do?

They first had two groups listen to an adult song or a children’s song and then asked participants how old they felt afterwards. They also asked participants for their ages and their fathers’ ages “allowing us to control for variation in baseline age across participants.” They got a p-value of 0.033 “proving” that listening to the children’s song made people feel younger.

They then forced the groups to listen to a Beatles song or the same children’s song (they assumed there was a difference), and again asked the ages. “We used father’s age to control for variation in baseline age across participants.”

According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to [the children’s song] (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040.

Ha ha! How did they achieve such a deliciously publishable p-value for a necessarily false result? Because of the broad flexibility in classical statistics which allows users to “data mine” for the small p-values.

Ways to Cheat

The authors list six major mistakes that users of statistics make. They themselves used many of these mistakes in “proving” the results in the experiment above. (We have covered all of these before: click Start Here and look under Statistics.)

“1.Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.” If not, it is possible to use a stopping rule which guarantees a publishable p-value: just stop when the p-value is small!

“2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.” Small samples are always suspicious. Were they the result of just one experiment? Or the fifth, discarding the first four as merely warm ups?

“3. Authors must list all variables collected in a study.” A lovely way to cheat is to cycle through dozens and dozens of variables, only reporting the one(s) that are “significant.” If you don’t report all the variables you tried, you make it appear that you were looking for the significant effect all along.

“4. Authors must report all experimental conditions, including failed manipulations.” Self explanatory.

“5. If observations are eliminated, authors must also report what the statistical results are if those observations are included.” Or: there are no such things as outliers. Tossing data that does not fit preconceptions always skews the results toward a false positive.

“6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.” This is a natural mate for rule 3.

The authors also ask that peer reviewers hold researchers’ toes to the fire: “Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.”

Our trio is also suspicious of Bonferroni-type corrections, seeing these as yet another way to cheat after the fact. And it is true that most statistics textbooks say design your experiment and analysis before collecting data. It’s just that almost nobody ever follows this rule.

Bayesian statistics also doesn’t do it for them, because they worry that it increases the researchers’ “degrees of freedom” in picking the prior, etc. This isn’t quite right, because most common frequentist procedures have a Bayesian interpretation with “flat” prior.

Anyway, the real problem isn’t Bayes versus frequentist. It is the mania for quantification that is responsible for most mistakes. It is because researchers quest for small p-values, and after finding them confuse them for holy grails, that they wander into epistemological thickets.

Now how’s that for a metaphor!

——————————————————————————————–

Thanks to reader Richard Hollands for suggesting this topic and alerting me to this paper.

1doi:10.1177/0956797611417632

19 Comments

  1. DAV

    Just when I thought the world couldn’t get any funnier, I find you had to be altered to a paper. Was the alteration necessary? If so, in what way? Squashed to paper thin? Rendered in pdf? Or …?

  2. Wits' End

    Each of the points numbered 1 through 6 were taught more than 45 years ago when I was in university.

    I have always assumed that they were still taught and it seems they are.

    So I am now wondering what forces are at work that allow so much work to be produced that can only be described as nonsense.

    The explanation cannot be that “statistics is badly taught” since the points numbered 1 through 6 mainly deal with the design of a study and involve only a modest understanding of statistics (e.g., the issue of covariation).

    ‘Publish or Perish’ might be part of the explanation. If there are many more journals now than 50 years ago ‘standards’ could change and might be more variable.

    Whatever the explanation it likely does not rest completely with the teaching of statistics.

  3. I don’t think this has anything to do with squashing you flat to make paper (though altering the analysis to make publishable papers might be closer) but I was puzzled by the following paragraph:

    “Bayesian statistics also doesn’t do it for them, because they worry that it increases the researchers’ ‘degrees of freedom’ in picking the prior, etc. This isn’t quite right, because most common frequentist procedures have a Bayesian interpretation with ‘flat’ prior.”

    What I don’t get is what “isn’t quite right” about the assertion that a Bayesian approach gives more freedom to choose an analysis which favours the desired conclusion. If the “frequentist” analysis has a Bayesian interpretation with ‘flat’ prior, then doesn’t that make it a special case which would then mean that the general Bayesian approach really does provide more freedom? (I don’t know that I’d refer to that as having more ‘degrees of freedom’ in the technical sense but that doesn’t seem to be your objection.)

  4. DAV

    A summary (more or less) of 1-6 appeared in Rick Brant’s Science Projects IIRC. Years must parallel distance as the haze increases annually. My memory isn’t as good as I remember it to be.

    For (point #4 above): sample rug sweeping jargon as in “Specimen strained during mounting” meaning “accidentally dropped onto the floor”.

    http://www.rickbrant.com/Books/SP/booksp.html

    Then there was this: http://www.rickbrant.com/Books/12/book12.html with an adventure surrounding some recent topics here albeit using a somewhat more portable device than a MRI scanner.

    (*sigh*) The world looks a lot different now than when I was ten.

  5. Most published research is done by people without training in statistics, probability, or the scientific method for that matter. They may have a canned statistics program they have been advised to use, or occasionally, a statistical consultant is brought in, after the fact, and his/her input is generally ignored anyway.

    There is a vast quantity of “scientific” research being done these days, in the main by people who cannot do it properly. And there is huge demand for preconceived findings. The result is an unending flood of agenda-driven pseudo science.

  6. George Steiner

    Mr. Brigs your hart keeps soaring like a hawk, but you may want to consider it trying to soar like an eagle. Hawks are OK but smallish and their soaring is mediocre. Now an eagle is something else. This bird really soars and is rather majestic. Particularly the bald eagle. I had one fly past me about 20 yards away at about 10 feet from the ground. The sound of the wings beating… now that is a bird.

  7. Big Mike

    “They then forced the groups to listen to a Beatles song or the same children’s song (they assumed there was a difference)”

    Nice shot!

  8. Carmen D'Oxide

    “Let me take you down
    To Strawberry Fields.
    Nothing is real
    And nothing to get hung about.”

  9. cb

    The complete failure of the procedure of linking the pretty numbers with reality itself is the single greatest flaw in the application of statistics. Period.

    The second greatest flaw follows from “Most published research is done by people without training in statistics, probability, or the scientific method for that matter.”
    People who have no idea what the hell they are doing, but do it anyway, and damn the consequences.
    Oh, and “training” here would be, at the least, a full degree in statistics: courses do not, will not, and never have meant much. I am an engineer, for what is worth, but this is so damn obvious that I regularly feel like screaming when people start yakking on about stats.

  10. Hi,

    Just like the “Bayesian statistics prove God exists!” vs “Bayesian statistics prove God doesn’t exist!” stuff, we shouldn’t judge the tool by their misuses.

    If people are being dishonest in their experiments, they’d be dishonest whether frequentist, or Bayesian, or anything other flavor.

    I like the graph above. It is a good reminder that a) p-values are random variables, b) replication is important, and c) larger samples are most often better.

    Justin Z. Smith
    http://www.statisticool.com

  11. Ray

    It has long been my contention that people whose field of study ends in “ology” take a course in statistics and learn to apply the methods in a rote way but don’t really understand the basic theory and don’t interpret the results correctly. My favorite example of this is the VIOXX study. That was the study that launched a thousand lawsuits with people claiming VIOXX caused their heart attack.

  12. Kurt

    “Bayesian statistics also doesn’t do it for them, because they worry that it increases the researchers’ “degrees of freedom” in picking the prior, etc. This isn’t quite right, because most common frequentist procedures have a Bayesian interpretation with “flat” prior. ”

    Though this is a great post generally, I want to reiterate what Alan Cooper said and state that what is quoted is non-sequitur.

  13. K2

    The corruption is science is a hell of a lot worse for humanity than any of the IPCC’s predictions for the effects of global warming. Yet the amount of effort and attention being expended on it is essentially zero in comparison.

  14. JH

    Alan B., Thanks for the link to the original paper.

    If the “frequentist” analysis has a Bayesian interpretation with ‘flat’ prior, then doesn’t that make it a special case which would then mean that the general Bayesian approach really does provide more freedom?

    Alan C., good observation. IOW, more ways of obtaining different conclusions. Ha…
    ____
    I am surprised how the ubiquitous statistical term degrees of freedom is used!

    The conclusions in the authors’ examples should be something like “there is a significant difference in the mean response (rating for feeling younger or age) between Kalimab group and Hot-Potato group for the population of interest.” (ANCOVA employed is not appropriate for the categorical response.)

    The key point for Figure 2: VIP
    ”The example shown in Figure 2 contradicts the often-espoused yet erroneous intuition that if an effect is significant with a small sample size then it would necessarily be significant with a larger one.”

    Another key point about false-positive results in the paper: it simply says that the more parameters to be tested (students) at the same significance level, the more likely that at least one parameter (student) will be significant (cheat).

    If statistics is applied correctly, it won’t tell you whatever you want. However, people will misuse it, regardless of frequestist or Bayesian method, and cheat with data manipulation. One can only lie to/ fool you with statistics if you don’t know it well!

Leave a Reply

Your email address will not be published. Required fields are marked *