Clever Statistical Method To Discover Fraud (Or Mistakes)

This is in the I-wish-I-had-thought-of-it category. A simple tool that suggests where fraud or major malfunctions in statistical research might exist.

First, a description of the tool; second, a description of the suspicious studies which called for their use. The tool is SPRITE, thought up by James Heathers. Sample Parameter Reconstruction via Iterative TEchniques.

Research reports statistical results in all sorts of forms, but a common one is the sample size and its mean and standard deviation. Since these are simple calculations, with a given n, given mean, and given SD, only certain sets of data can support them. For instance, with n = 2 and mean of 3, the sample (1000, 1) is impossible regardless what the SD is. Specifying the SD further restricts the possible samples, as does other information, such as e.g. each number must be positive.

This SPRITE takes the mean, SD, etc. specifications and runs through the possible samples. Think of it as a form of reverse engineering. Once the possible samples are arrayed in front of you, interesting things might emerge.

Now for the study which led Heathers to SPRITE. It is one of a set of curious, which is to say suspicious, work coming out of the Food and Brand Lab at Cornell University. This is the opinion of inter alia New York magazine and Slate.

The paper here in question is “Attractive Names Sustain Increased Vegetable Intake in Schools” by Wansink et.al. (2012). See Retraction Watch for words with Wansink. Heathers says the paper “presents a simple thesis: change the name of ‘carrots’ and ‘beans’ and ‘broccoli’ to something exciting that the kids are doing (I don’t know, ‘Buzz Lightyear chard’ or ‘Pokemon kale’ etc.) and children will eat more of it.”

In the control group (at some elementary school), which called carrots carrots, the number of carrots served by the lunch lady had a reported mean of 19.4 and SD of 19.9, and n = 45 (kids). The number eaten of those carrots served was small, at least compared to the group called carrots “X-Ray Vision Carrots”. Wee p-values confirmed the “findings.”

Wait. Did Wansink say kids in the control group were served on average almost 20 carrots? They did. X-Ray carrot kids were served an average of 17 (but reportedly ate more).

Well, if the mean was 19.4, and SD was 19.9, what are the possibilities for the maximum number of carrots with a sample size of n = 45 and that “you can’t have less than zero carrots (there are no negative carrots, this isn’t Star Trek).”

Heathers’s SPRITE showed the minimum max-carrots was 53, with a maximum max-carrots of 73, with more likely values being in the neighborhood of 60-some carrots.

Given what we know of lunch ladies, serving trays, and size of carrots, is it plausible to suggest some kid was really served 60 carrots? Only if, according to Heathers, “at least one of [the students] is a Clydesdale horse.”

What makes the story cute is that Heathers assembled 60 baby carrots to see what the pile looked like. (I suppose Wansink could have meant slices of carrots and not carrots, but there’s no indication of this that I could discover.)

Eat this

This was not the only difficulty of the Wansink study; Heathers details more. And it’s not the only difficulty with the research group. New York magazine called their work “really shoddy”.

…Wansink published a strange blog post last month, which led to the subsequent discovery of 150 errors in just four of his lab’s papers, strong signs of major problems in the lab’s other research, and a spate of questions about the quality of the work that goes on there. Wansink, meanwhile, has refused to share data that could help clear the whole thing up.

Here come the wee p-values.

Wansink was acknowledging, with surprising openness, taking a “failed study which had null results,” slicing and dicing the data until something interesting came out, and then publishing not one but four papers based on said slicing and dicing…One of the truisms of statistics, after all, is that if you analyze enough data from enough angles, you will discover relationships that are “significant,” in the statistical sense of the term, but that don’t actually mean anything.

God bless the magazine for its scare quotes around “significant”. If you can’t find a wee p-value in your data, you’re not trying hard enough. And God bless them for these final true words:

“Many of psychology’s most exciting ‘This One Simple Trick Can X’—style findings have turned out to be little more than statistical noise shaped sloppily into something that, in the right light and if you don’t look too hard, looks meaningful.”

8 Comments

Ye Olde Statistician

October 17, 2017, 9:12 am

Please tell us that at least one of these stories was titled “What’s up, Doc?”
Ray

October 17, 2017, 10:35 am

Isn’t it a truism that if you torture the data enough it will tell what you want?
Gary

October 17, 2017, 11:40 am

From the link: What SPRITE does, at the most basic level, is simple?—?it automates the above, [the experimental data] and it’s very fast. The median time it takes to generate a sample with the properties described above (M=3.00,SD=1.56,n=20) is 7.37ms.

In other words, for any given mean, we shuffle the available values (very quickly) until we generate a sample with the parameters we’re interested in. Then we do it again, and again, and again. We find hundreds or thousands of plausible solutions. We can model them further if we want.

Oh, great. Now you can generate fake data after you’ve picked your parameters. And rapidly too. All that’s left is to fake the lab notebooks. No need for messy experimenting and ex post facto fudging…
Jan van Rongen

October 17, 2017, 2:46 pm

I wrote about this “method” back in 2011, because Diederik Stapel made the same “errors” in several of his papers. That was in Dutch and for a non technical audience (ht tps://www.foodlog.nl/artikel/sprookjes-hoe-je-van-tofu-vergeetachtig-wordt-en-van-vleeseten-asocialer/ and http://www.mrooijer.nl/blog/skepsis/vonk-stapel/) , just a week before his large scale fraud was discovered. Staple published means (and sd’s) that could not exist, given the size of his samples.

A year later I wrote a small note: “Not all numbers are Mean” here : http://www.mrooijer.nl/stats/2012/not-all-numbers/.
Ken

October 17, 2017, 3:26 pm

All that talk of carrots — and the real conundrum is: “Why is a carrot more orange than an orange?”

From Ted Nugent & the Amboy Dukes’ album, ‘Journey to the Center of the Mind’: http://www.youtube.com/watch?v=2WwjCl7ZAhw
Ye Olde Statistician

October 17, 2017, 5:33 pm

Ray, If you torture the data enough, it will p itself.
Char Paul

October 17, 2017, 7:09 pm

very cool! Am keen to try the tool
Pingback: Cornell Professor Exposes His Wee P-values One Too Many Times: All P-values Are P-Hacking – William M. Briggs

Clever Statistical Method To Discover Fraud (Or Mistakes)

Related

8 Comments

Leave a Reply

Share this:

Related

8 Comments

Leave a Reply