October 26, 2008 | 8 CommentsThe book is *The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives* by Deirdre Nansen McCloskey and Steve Ziliak.

From the description at Amazon:

*The Cult of Statistical Significance* shows, field by field, how “statistical significance,” a technique that dominates many sciences, has been a huge mistake. The authors find that researchers in a broad spectrum of fields, from agronomy to zoology, employ “testing” that doesnâ€™t test and “estimating” that doesn’t estimate. The facts will startle the outside reader: how could a group of brilliant scientists wander so far from scientific magnitudes? This study will encourage scientists who want to know how to get the statistical sciences back on track and fulfill their quantitative promise. The book shows for the first time how wide the disaster is, and how bad for science, and it traces the problem to its historical, sociological, and philosophical roots.

This is part of the theme I’ve long been pushing. McCloskey and Steve Ziliak are shocked, perplexed, and bewildered that classical statistics and **p-values** are still being used.

I’m not so shocked. They want people to abandon p-values and start using *effect sizes*. A fine first step, but one that doesn’t solve the whole problem.

I say we should drop p-values like Obama dropped Rev. Wright, eschew effect sizes like Joe Biden did reality, and return to observables. Let me, as they say, illustrate with a (condensed) example from by book.

Suppose there are two advertising campaigns A and B for widget sales. Since we don’t know how many sales will happen under A or B, we quantify our uncertainty in this number using a probability distribution. We’ll use a normal, since everybody else does, but the example works for any probability distribution.

Now, a normal distribution requires two unobservable numbers, called *parameters*, to be specified so that you can use it. The names of these two parameters are μ and σ. Both ad campaigns need their own, so we have μ_{A} and σ_{A}, and μ_{B} and σ_{B}. Current practice more or less ignore the σ_{A} and σ_{B}, so we will too.

Here is what “statistical significance” is all about.

Actual sales data under the two campaigns A and B is taken. A *statistic* is calculated: Call it T. It is a function of differences in the observed sales under both campaign. Never mind how it’s calculated. T is not unique, and for any problem dozens are available. With T in hand, the classical statistician makes this mathematical statement:

μ_{A}=μ_{B}

and then the infamous p-value is calculated, which is

Probability(Another T > Our T given that μ_{A}=μ_{B})

where the “Another T” is the statistic we would get if we were to repeat the entire experiment again. Do we repeat it again? No, so we are already in deep waters. But never mind.

If the p-value is less than the magic number of 0.05, then the results are said to be *statistically significant*.

Quick readers will have spotted the major difficulty. What does equating two unobservable parameters in order to calculate some weird probability have to do with whether the campaigns are different than one another?

The words are *not much*, which is why McCloskey and Ziliak call the dependence on p-values a cult.

They recommend, in its place, estimating the effect size, which is this:

μ_{A} – μ_{B}.

Eh. It’s part way there, but it’s still a statement about unobservable parameters (and it still ignores the other unobservable parameters σ_{A} and σ_{B}).

What people really want to know is this:

Probability(Sales A > Sales B given old data).

Or they’d like to estimate the actual sales under A or B. There are new ways that can calculate these actual probabilities of interest. However, you won’t learn these methods in any but the most esoteric statistics class.

And *that* is what should change.

Because, I am here to tell you, you can have a p-value as small as you like, you can have an effect size as big as you like, but it can still be the case that

Probability(Sales A > Sales B given old data) ~ 50%!

which is the same as just guessing. Yes, the actual, observable numbers, the real-life stuff, the physical, measurable, tangible decisionable reality can be no different at all. At least, we might not be able to tell they are any different.

And that’s the point. The old ways of doing things were set up to make things too certain.

I wouldn’t go so far as to say reliance on the old ways was *cultish*. Most people just don’t know of the alternatives.