# The Ease of Cheating With Statistics

*Thanks to readers Ari Schwartz and Tom Pollard for suggesting this article.*

Take any two sets of numbers, where the only restriction is that a reasonable chunk inside each set has to be different than one another. That is, we don’t want all the numbers inside a set to be equal to one another. We also want the sets to be, more or less, different, though it’s fine to have some matches. Make sure there is a least a dozen or two numbers in each set: for ease, each set should be the same size.

You could collect numbers like this in under two minutes. Just note the calories in an “Serving size” for a dozen different packages of food in your cupboard. That’s the first set. For the second, I don’t know, write down the total page counts from a dozen books (don’t count these! just look at the last page number and write that down).

All set? Type these into any spreadsheet in two columns. Label the first column “Outcome” and label the second column “Theory.” It doesn’t matter which is which. If you’re too lazy to go to the cupboard, just type a jumble of numbers by placing your fingers over the number keys and closing your eyes: however, this will make trouble for you later.

The next step is trickier, but painless for anybody who has had at least one course in “Applied Statistics.” You have to migrate your data from that spreadsheet so that it’s inside some statistical software. Any software will do.

OK so far? You now have to *model* “Outcome” as a function of “Theory.” Try linear regression first. What you’re after is a small p-value (less than the publishable 0.05) for the coefficient on “Theory.” Don’t worry if this doesn’t make sense to you, or if you don’t understand regression. All software is set up to flag small p-values.

If you find one—a small p-value, that is—then begin to write your scientific paper. It will be titled “Theory is associated with Outcome.” But you have to substitute “Theory” and “Outcome” with suitable scientific-sounding names based on the numbers you observed. The advantage of going to the cupboard instead of just typing numbers is now obvious.

For our example, “Outcome” is easy: “Calorie content”, but “Theory” is harder. How about “Literary attention span”? Longer books, after all, require a longer attention span.

Thus, if you find a publishable p-value, your title will read “Literary attention span is associated with diet”. If you know more about regression and can read the coefficient on “Theory”, then you might be cleverer and entitle your piece, “Lower literary attention spans associated with high caloric diets.” (It might be “Higher” attention spans if the regression coefficient is positive.)

That sounds plausible, does it not? It’s suitably scolding, too, just as we like our medical papers to be. We don’t want to hear about how gene X’s activity is modified in the presence of protein Y, we want admonishment! And we can deliver it with my method.

If you find a small p-value, all you have to do is to think up a Just-So story based on the numbers you have collected, and academic success is guaranteed. After your article is published, write a grant to explore the “issue” more deeply. For example, we haven’t even begun to look for racial disparities (the latest fad) in literary and body heft. You’re on your way!

But that only works if you find a small p-value. What if you don’t? *Do not despair!* Just because you didn’t find one with regression, does not mean you can’t find one in another way. The beauty of classical statistics is that it was designed to produce success. You can find a small, publishable p-value in any set of data using ingenuity and elbow grease.

For a start, who said we *had* to use linear regression? Try a non-parametric test like the Mann-Whitney or Wilcoxon, or some other permutation test. Try non-linear regression like a neural net. Try MARS or some other kind of smoothing. There are dozens of tests available.

If none of those work, then try dichotomizing your data. Start with “Theory”: call all the page counts larger than some number “large”, and all those smaller “small.” Then go back and re-try all the tests you tried before. If that still doesn’t give satisfaction, un-dichotomize “Theory” and dichotomize “Outcome” in the same way. Now, a whole new world of classification methods awaits! There’s logistic regression, quadratic discrimination, and on and on and on… And I haven’t even told you about adding more numbers or adding more columns! Those tricks guarantee small p-values.

In short, if you do not find a publishable p-value with your set of data, then you just aren’t trying hard enough.

Don’t believe just me. Here’s an article over at Ars Technica called “We’re so good at medical studies that most of them are wrong” that says the same thing. A statistician named Moolgavkar said “that two models can be fed an identical dataset, and still produce a different answer.”

The article says, “Moolgavkar also made a forceful argument that journal editors and reviewers needed to hold studies to a minimal standard of biological plausibility.” That’s a good start, but if we’re clever in our wording, we could convince an editor that a book length and calories correlation is biologically plausible.

The real solution? As always, *prediction and replication*. About which, we can talk another time.

Cereal bowl empiricism strikes again!

Cool. Does this apply even when the researchers are really, really sincere, and have our best interests at heart?

> o = rpois(1000,50)

> t = rpois(1000,20)

> summary(lm(o ~ t))

Call:

lm(formula = o ~ t)

Residuals:

Min 1Q Median 3Q Max

-17.9077 -4.8109 -0.6173 4.2980 24.9955

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 48.84288 1.04754 46.63 <2e-16 ***

t 0.04840 0.05096 0.95 0.342

---

Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

`Residual standard error: 7.044 on 998 degrees of freedom`

Multiple R-squared: 0.000903, Adjusted R-squared: -9.814e-05

F-statistic: 0.902 on 1 and 998 DF, p-value: 0.3425

No good.

> summary(lm(o ~ t-1))

Call:

lm(formula = o ~ t - 1)

Residuals:

Min 1Q Median 3Q Max

-45.959 -5.775 2.595 10.474 37.817

Coefficients:

Estimate Std. Error t value Pr(>|t|)

t 2.37024 0.01931 122.7 <2e-16 ***

---

Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

`Residual standard error: 12.55 on 999 degrees of freedom`

Multiple R-squared: 0.9378, Adjusted R-squared: 0.9378

F-statistic: 1.507e+04 on 1 and 999 DF, p-value: < 2.2e-16

Bingo!

Note: no statisticians were harmed in the making of this nonsense.

Rich,

Brilliant! Now you need to write the paper. A good, believable story is just as important as the wee p-value.

Rich,

Nice job. You have just got to be a dendrochronologist to have come up with such a significant finding so quickly. Come to think about it, perhaps not. You published your code.

Is this a function of telecommunication between the bookshelf and the cereal bowl?

The referenced article is interesting. A commenter to the article pointed to a talk by Dr. Young:

http://www.americanscientist.org/science/pub/everything-is-dangerous-a-controversy

In his talk, Dr. Young picks on epidemiology, but you could replace “epidemiology” with “global warming” and completely devastate the whole AGW scam.

Highly recommended to anyone interested at all in statistics.

Person of Choler, All,

I wept with delight listening to Young’s lecture. It should be

mandatory listeningfor all regulars.Young and I differ, trivially, on “randomness” and a few other things, but on his main point we are brothers. I suggest another title for his talk: Epidemiologists Run Amok.

(I worked with one of Young’s co-authors, Heejung Bang, at Cornell.)

Perhaps all of the readers here have read, Darrell Huff’s How to Lie with Statistics. If not, the book, published in 1954, would still be a good addition to your Kindle in 2010.

I have two copies of Huff’s book. Is that a statistic? Indeed, is that a lie?

Anyway, it correlates with my wife’s having been a social scientist.

Huff is required reading for all of my direct (and in some cases, indirect) reports.

Wow, I had no idea it was this easy. For my dissertation, I did multiple experiments and compared my results with results from other algorithms and even then never made any strong statements as in my field those can get you ripped apart.

Anyway, thanks for that concise description. I will continue to look at the results of any statistical analysis carefully.

If you torture the statistics enough, they will confess.

My favorite story is of a real estate broker who wanted to build a model of home prices in his neighborhood. I don’t rember his inputs (sq footage, lot size, etc.). He collected his data, ran a multivariate linear regression, rationalised and exuse to exlude an observation. He was left with 4 comps, 3 variables and a 100% R squared. He was thrilled.

Is the part of the problem that it is just too easy to collect data? 95% confidence out of date. If 99% became the new standard of significance, you would have to sqeeze the data much harder. However, if you had a valid correlation, it shouldn’t be hard to aquire sufficient data. Or, does the rot go deeper?

lies, damn lies, and statistics. same as it ever was.

(twain or disraeli – depending).

but you already knew this.

If p = 0.09, but my theory data involves tobacco, may I still publish?

I listened to Stan Young’s lecture. It was clear and entertaining – much better than just trying to read his annotated slide deck ( http://niss.org/sites/default/files/Young_Safety_June_2008.pdf ). The Maverick Solitaire example certainly makes the point about multiple models.

My first thought was how could we get him to look at dendrochronology. I fear, however, he would have a heart attack on the spot.

Matt, do you think you can ask to see if he has a viewpoint?

What struck me listening to the first part of the lecture was that Young seemed to have successfully alerted the editors at the Royal Society as to the scope of the issues of authors not archiving their data and methods.

I followed up with Dr. Stuart Taylor, the very senior editor mentioned by Young at the Royal Society, to see whether Young’s efforts to get them to look at his article refuting the earlier cereal ==> sex of child article actually triggered or was connected in some way to the release of Briffa’s Yamal data — Briffa’s article was published in the same journal in roughly the same time period.

We shall see.

The issues discussed in the referenced link have been around for a long time. The link also indirectly presents one of the reasons why articles published in some medical journals are deem to be of little statistical merits and add little value to an academic statisticianâ€™s credentials at a research university. Ohâ€¦ I better stop giving away some of the deep, dark secrets of academe.

Rich, how come I didnâ€™t get the same output! Haâ€¦ this reminds me of the joke about a student who checks his answers to true/false questions by tossing a coin.

Matt,

Funny you mention neural-nets. Give me enough nodes and I can fit one to any data set with almost perfect accuracy. Just like fitting a polynomial, given enough terms. And their predictive power is not any better than a polynomial fit (in fact worse when there are gaps in the data set). I would never consider using either method to prove any point.

While I use neural-nets for classification problems, I also apply some simple statistical tests to the data to keep the network from over-fitting the data. Funny why I don’t seem to see much discussion of this in the popular textbooks, but then sometimes I think the professors who write them may not always have practical experience with real world problems.