# How to Fool Yourself—And Others—With Statistics

*See the news box to the left. I wrote this long ago and never used it. I do not love it. But since I am so busy, I haven’t the time to write something new. Feel free to disparage.*

Remember how much you hated your college statistics course? It made little sense. It was confusing, even nonsensical. It was an endless stream of meaningless, hard-to-remember formulas.

All that is true—it *was* awful—but you were wrong to hate it. Because it has been a balm and a boon to mankind, especially to researchers in need of a paper. Publish or perish rules academia, and no other tool has been as useful in generating papers as statistics has.

Statistics is so powerful that it can create positive results in nearly any situation, including those in which it shouldn’t. For example, this week we read in the newspaper that “statistics show mineral X” is good for you, only to read next week that “statistics show” it isn’t. How can statistics be used to simultaneously prove *and* disprove the same theory? Easy.

But first note that I am talking about how statistics as she is practiced by the unwary or unscrupulous. Statisticians themselves, as everybody knows, are the most conscientious and honest bunch of people on the planet.

**How to prove your theory**

Step 1: Start with a theory or hypothesis you want to be true.

Step 2: Gather data that might be related to that theory; more is better.

Step 3: Choose a probability model for that data. Remember the “bell-shaped curve”? That’s a model, one of hundreds at your disposal.

Step 4: These models have knobs called *parameters* which are tuned—via complex mathematics—so that the model fits.

Step 5: Now it gets tricky. Pick a test from that set of formulae you were made to memorize. This test must say how your theory relates to the model’s parameters. For example, you might declare, “If my theory is true, then this certain knob cannot be set to zero.” The test then calculates a *statistic*, which is some mathematical function of your data.

You then calculate the probability of seeing a statistic as large as you just calculated *given* that the relevant knob *is* set to zero. That is, the test says how unusual the observed statistic is given that the probability-parameter statement about your theory is true—*and given the model you picked is correct*.

You might dimly recall that the result of this calculation is called a *p-value*. It’s true definition is so difficult to remember that nobody can remember it. What people do remember is that a small one—less than 0.05—is good.

If that level is reached, you’re allowed to declare *statistical significance*. This is *not* the same as saying your theory is true, but nobody remembers that, either. Significance is vaguely meaningful only if both a model and the test used being are true and optimal. It gives *no* indication of the truth or falsity of any theory.

Statistical significance is easy to find in nearly any set of data. Remember that we can choose our model. If the first doesn’t give joy, pick another and it might. And we can keep going until one does.

We also must pick a test. If the first doesn’t offer “significance”, you can try more until you find one that does. Better, each test can be tried for each model.

If that sounds like too much work, there’s a trick. Due to a quirk in statistical theory, for any model and any test, statistical “significance” is *guaranteed* as long as you collect enough data. Once the sample size reaches a critical level, small p-values practically rain from the data.

But if you’re impatient, you can try *subgroup analysis*. This is where you pick your way through the data, keeping only what’s pretty, trying various tests and models until such a time as you find a small p-value.

The lesson is that it takes a dull researcher not to be able to find statistical “significance” somewhere in his data.

**Boston Scientific**

About two years ago the *Wall Street Journal* (registration required) investigated the statistical practices of Boston Scientific, who had just introduced a new stent called the Taxsus Liberte.

Boston Scientific did the proper study to show the stent worked, but analyzed their data using an unfamiliar test, which gave them a p-value of 0.049, which is statistically significant.

The *WSJ* re-examined the data, but used different tests (they used the same model). Their tests gave p-values from 0.051 to about 0.054; which are, by custom, *not* statistically significant.

Real money is involved, because if “significance” isn’t reached, Boston Scientific can’t sell their stents. But what the *WSJ* is quibbling, because there is no real-life difference between 0.049 and 0.051. P-values do not answer the only question of interest: does the stent work?

**The moral of the story**

No theory should be believed because a statistical model reached “significance” on a set of already-observed data. What makes a theory useful is that it can predict accurately *never-before-observed* data.

Statistics can be used for these predictions, but it almost never is.

I think predictions are avoided on the principle that when ignorance is bliss, tis folly to know that your theory can’t be published.

Incidentally, we statisticians have heard every version of “liars figure”, “dammed lies”, etc., so you’ll pardon me for not chuckling when in response you whip out your Disraeli.

**Update** If you thought this post was bad, you might try watching this video (I can think of at least two good reasons to): A Strange Tale About Probability.

Your right that there is little difference in 0.049 and 0.051 p-factors. They’re just a matter of judgment anyway so any value should suffice. In fact, I once came across a NIH study that used 0.2. Must be that Bayesian stuff — anything goes. You forgot to mention this as a last-straw technique.

I conducted many tests as a youngster and determined that jellied bread falls jelly side down a significant number of times (p-factor 0.5). In subsequent tests, this occurred at the same rate as predicted by my model so I know it’s not just an illusion. Pretty darn good, huh?

hmmm … one chance in a bouillon and a model with two modes? I presume ‘Latch’ is Paul’s stage name although he did look quite at home.

So now we see why you’ve been so “busy”, eh? All that surfin’ and formulae-ala-Twyla selection must have taken an enormous amount of time away from real work. You have our sympathies.

Neural networks and the closely related PCA method can be used to fit almost any dataset. Getting them to properly predict things is a “trick” much more difficult than the ones used by the researchers at CRU. I use MLP networks for classifiers and found that a full understanding of the statistics and also what is in the datasets is necessary to get good results. I’ve seen PhDs run away from the MLP because it was too tough to tame. I say not so if you are careful how you use it. I find it interesting that most of the text books I’ve read don’t go into some of the “tricks” I’m using, like they don’t care about real world accuracy and are more interested in the theory than practical aspects.