The book is *The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives* by Deirdre Nansen McCloskey and Steve Ziliak.

From the description at Amazon:

The Cult of Statistical Significanceshows, field by field, how “statistical significance,” a technique that dominates many sciences, has been a huge mistake. The authors find that researchers in a broad spectrum of fields, from agronomy to zoology, employ “testing” that doesnâ€™t test and “estimating” that doesn’t estimate. The facts will startle the outside reader: how could a group of brilliant scientists wander so far from scientific magnitudes? This study will encourage scientists who want to know how to get the statistical sciences back on track and fulfill their quantitative promise. The book shows for the first time how wide the disaster is, and how bad for science, and it traces the problem to its historical, sociological, and philosophical roots.

This is part of the theme I’ve long been pushing. McCloskey and Steve Ziliak are shocked, perplexed, and bewildered that classical statistics and **p-values** are still being used.

I’m not so shocked. They want people to abandon p-values and start using *effect sizes*. A fine first step, but one that doesn’t solve the whole problem.

I say we should drop p-values like Obama dropped Rev. Wright, eschew effect sizes like Joe Biden did reality, and return to observables. Let me, as they say, illustrate with a (condensed) example from by book.

Suppose there are two advertising campaigns A and B for widget sales. Since we don’t know how many sales will happen under A or B, we quantify our uncertainty in this number using a probability distribution. We’ll use a normal, since everybody else does, but the example works for any probability distribution.

Now, a normal distribution requires two unobservable numbers, called *parameters*, to be specified so that you can use it. The names of these two parameters are μ and σ. Both ad campaigns need their own, so we have μ_{A} and σ_{A}, and μ_{B} and σ_{B}. Current practice more or less ignore the σ_{A} and σ_{B}, so we will too.

Here is what “statistical significance” is all about.

Actual sales data under the two campaigns A and B is taken. A *statistic* is calculated: Call it T. It is a function of differences in the observed sales under both campaign. Never mind how it’s calculated. T is not unique, and for any problem dozens are available. With T in hand, the classical statistician makes this mathematical statement:

μ_{A}=μ_{B}

and then the infamous p-value is calculated, which is

Probability(Another T > Our T given that μ_{A}=μ_{B})

where the “Another T” is the statistic we would get if we were to repeat the entire experiment again. Do we repeat it again? No, so we are already in deep waters. But never mind.

If the p-value is less than the magic number of 0.05, then the results are said to be *statistically significant*.

Quick readers will have spotted the major difficulty. What does equating two unobservable parameters in order to calculate some weird probability have to do with whether the campaigns are different than one another?

The words are *not much*, which is why McCloskey and Ziliak call the dependence on p-values a cult.

They recommend, in its place, estimating the effect size, which is this:

μ_{A} – μ_{B}.

Eh. It’s part way there, but it’s still a statement about unobservable parameters (and it still ignores the other unobservable parameters σ_{A} and σ_{B}).

What people really want to know is this:

Probability(Sales A > Sales B given old data).

Or they’d like to estimate the actual sales under A or B. There are new ways that can calculate these actual probabilities of interest. However, you won’t learn these methods in any but the most esoteric statistics class.

And *that* is what should change.

Because, I am here to tell you, you can have a p-value as small as you like, you can have an effect size as big as you like, but it can still be the case that

Probability(Sales A > Sales B given old data) ~ 50%!

which is the same as just guessing. Yes, the actual, observable numbers, the real-life stuff, the physical, measurable, tangible decisionable reality can be no different at all. At least, we might not be able to tell they are any different.

And that’s the point. The old ways of doing things were set up to make things too certain.

I wouldn’t go so far as to say reliance on the old ways was *cultish*. Most people just don’t know of the alternatives.

Ok, but what is the alternative you suggest?

SteveBrookline,

I suggest directly computing probabilities like

Probability(Sales A > Sales B given old data)

In the biz, this goes under the name “predictive inference.” Not the best name in the world, kind of misleading, but that’s the historical one. Maybe we can come up with a new one.

How about

observable statistics? Doesn’t quite roll off the tongue.I have some of these methods posted here, and of course, I go and on and on in my book (which will be coming soon).

There is surely a simpler way of looking at this problem. The two sales campaigns A and B were done to decide which one should be rolled out nationally or otherwise scaled up, and which should be dropped (that’s the assumption I’m making, because otherwise we don’t need any inference at all).

But if you _have_ to make a decision, you just choose the one that worked best. The probability that it’s better doesn’t matter until it comes to explaining later on why it wasn’t successful and how it wasn’t your fault 😉

William,

For the most part you are right: just pick the one that worked best. That is often, absolutely, the right answer.

But complications come in. What if we were only able to test A over two weeks but we could do B over a month? What if it costs 50% more for B than for A? What if, over the area or time covered by B, the people who fell under the campaigns’ sway were quantifiably different?

These, and other, considerations force you to think about probability.

William,

You said “There are new ways that can calculate these actual probabilities of interest” Please elaborate with some references.

Thanks

Bill R,

There are some books out there, but unfortunately they are all at a level beyond most introductory texts.

A decent proceedings is Modelling and Prediction: Honoring Seymour Geisser

by Jack Lee, Wesley Johnson, and Arnold Zellner. It’s a mixed book containing papers from a conference.

There is Geisser’s own work, naturally. Predictive Inference. This is very hard to read unless you have a lot of probability under your belt. Geisser was a great mathematician, but not a great explainer.

There are one or other others (search for “predictive inference”), but not many.

Then there will be (look for an announcement this week!) my book, which is at the intro level, and which contains code you can try. Code is free and fits into R, which is also free.

I’ll also be giving more examples soon of some of the major differences. This week, certainly.

William,

Thanks for the references. I’ve ordered Geisser’s book. While I was at UNC in the ’70’s Dana Quade talked about using pair charts, U-statistics and matched sets as a vehicle for inference, (the latter similar to Rubin’s work) in a non-parametric context. Not too surprising, since he was one of Hoeffding’s students. I moved off in a different direction once I left. Should have paid more attention!

Bill

Bill,

Lot of good non-parametric stuff out there. I often like the idea of non-parametric because it is based, in part, on logic, and that’s the right starting point.