Thanks to the many readers who sent me Johnson’s paper, which is here (pdf). Those who haven’t will want to read “Everything Wrong With P-values Under One Roof“, the material of which is assumed known here.
Johnson’s and Our Concerns
A new paper1by Valen Johnson is creating a stir. Even the geek press is weighing in. Ars Technica writes, “Is it time to up the statistical standard for scientific results? A statistician says science’s test for significance falls short.” Johnson isn’t the only one. It’s time for “significance” tests to make their exit.
Why? Too easy, as we know, to claim that the “science shows” the sky is falling. Johnson says the “apparent lack of reproducibility threatens the credibility of the scientific enterprise.” Only thing wrong that sentiment is the word “apparent.”
The big not-so-secret is that most experiments in the so-called soft sciences, which—I’m going to shock you—philosopher David Stove called the “intellectual slums”, are never reproduced. Not in the sense that the exact same experiments are re-run looking for similar results. Instead, data is collected, models are fit, and pleasing theories generated. Soft scientists are too busy transgressing the boundaries to be bothered to replicate what they already know, or hope, is true.
I’ve written about how classical (frequentist) statistics works in detail many times and won’t do so again now (see the Classic Post page under Statistics). There is only one point to remember. Users glop data into a model, which accommodates that data by stretching sometimes into bizarre shapes. No matter. The only thing which concerns anybody is whether the model-data combination spits out a wee p-value, defined as a p-value less than the magic number.
Nobody ever remembers what a p-value is, and nobody cares that they do not remember. But everybody is sure that the p-value’s occult powers “prove” whatever it is the researcher wanted to prove.
Johnson, relying on some nifty mathematics which tie certain frequentist and Bayesian procedure together, claims the magic number is too high. He advises a New & Improved! magic number ten times smaller than the old magic number. He would accompany this smaller magic number with a (Bayesian) p-value-like measure, which says something technical, just the like p-value actually does, about how the data fits the model.
This is all fine (Johnson’s math is exemplary), and his wee-er p-value would pare back slightly the capers in which researchers engage. But only slightly. Problem is that wee p-values are as easy to discover as “outraged” Huffington Post writers. As explained in my above linked article, it will only be a small additional burden for researchers to churn up these new, wee-er p-values. Not much will be gained. But go for it.
What should happen
What’s needed is not a change in mathematics, but in philosophy.
First, researchers need to stop lying, stop exaggerating, restrain their goofball stunts, quit pretending they can plumb the depths of the human mind with questionnaires, and dump the masquerade that small samples of North American college students are representative of the human race. And when they pull these shenanigans, they ought to be called out for it.
But by whom? Press releases and news reports have little bearing to what happened in the data. The epidemiologist fallacy is epidemic. Policy makers are hungry for verification. Do you know how much money government spends on research? Scientists are people too and no better than civilians, it seems, at finding evidence contrary to their beliefs. Though they’re much better at confirming their opinions.
This is all meta-statistical, i.e. beyond the model, but it all affects the probability of questions at hand to a far greater degree than the formal mathematics. (Johnson understands this.) The reason we given abnormal attention to the model is that it is just that part of the process which we can quantify. And numbers sound scientific: they are magical. We ignore what can’t be quantified and fix out eyes on the pretty, pretty numbers.
Second: remember sliding wooden blocks down inclined planes back in high school? Everything set up just so and, lo, Newton’s physics popped out. And every time we threw a tiny chunk of sodium into water, festivities ensued, just like the equations said they would. Replication at work.
That’s what’s needed. Actual replication. The fancy models fitted by soft scientists should be used to make predictions, just like the models employed by physicists and chemists. Every probability model that spits out a p-value should instead spit out guesses about what data never2 seen before would look like. Those guesses could be checked against reality. Bad models unceremoniously would be dumped, modest ones fixed up and made to make new predictions, and good ones tentatively accepted.
“Tentatively” because scientists are people and we can’t trust them to do their own replication.
The technical name for predictive statistics is Bayesian posterior predictive analysis, where all memories of parameters disappear (they are “integrated out”). There are no such things as p-values or Bayes factors. All that is left is observables. A change in X causes this change in the probability of Y, the model says. So, we change X (or looked for a changed X in nature) and then see if the probability of Y accords with the actual appearance of Y. Simple!
This technique isn’t used because (a) the math is hard, (b) it is unknown except by mathematical statisticians, and (c) it scares the hell out of researchers who know they’d have far less to say. Even Johnson’s method will double current sample sizes. Predictive statistics requires a doubling of the doubling—and much more time. The initial data, as before, is used to fit the model. Then predictions are made and then we have to wait for new data and see if the predictions match.
Right climatologists? Ain’t that so educationists? Isn’t this right sociologists?
Caution: even if predictive statistics are used, it does not solve the meta-statistical problems. No math can. We will always be in danger of over-certainty.
1Actually a summary paper. See his note 21 for directions to the real guts.
2This is not cross validation. There we re-use the same data multiple times.