Johnson’s Revised Standards For Statistical Evidence

Valen Johnson: Notice how his shirt matches his hair.

Valen Johnson: Notice how his shirt matches his hair.

Thanks to the many readers who sent me Johnson’s paper, which is here (pdf). Those who haven’t will want to read “Everything Wrong With P-values Under One Roof“, the material of which is assumed known here.

Johnson’s and Our Concerns

A new paper1by Valen Johnson is creating a stir. Even the geek press is weighing in. Ars Technica writes, “Is it time to up the statistical standard for scientific results? A statistician says science’s test for significance falls short.” Johnson isn’t the only one. It’s time for “significance” tests to make their exit.

Why? Too easy, as we know, to claim that the “science shows” the sky is falling. Johnson says the “apparent lack of reproducibility threatens the credibility of the scientific enterprise.” Only thing wrong that sentiment is the word “apparent.”

The big not-so-secret is that most experiments in the so-called soft sciences, which—I’m going to shock you—philosopher David Stove called the “intellectual slums”, are never reproduced. Not in the sense that the exact same experiments are re-run looking for similar results. Instead, data is collected, models are fit, and pleasing theories generated. Soft scientists are too busy transgressing the boundaries to be bothered to replicate what they already know, or hope, is true.

What happens

I’ve written about how classical (frequentist) statistics works in detail many times and won’t do so again now (see the Classic Post page under Statistics). There is only one point to remember. Users glop data into a model, which accommodates that data by stretching sometimes into bizarre shapes. No matter. The only thing which concerns anybody is whether the model-data combination spits out a wee p-value, defined as a p-value less than the magic number.

Nobody ever remembers what a p-value is, and nobody cares that they do not remember. But everybody is sure that the p-value’s occult powers “prove” whatever it is the researcher wanted to prove.

Johnson, relying on some nifty mathematics which tie certain frequentist and Bayesian procedure together, claims the magic number is too high. He advises a New & Improved! magic number ten times smaller than the old magic number. He would accompany this smaller magic number with a (Bayesian) p-value-like measure, which says something technical, just the like p-value actually does, about how the data fits the model.

This is all fine (Johnson’s math is exemplary), and his wee-er p-value would pare back slightly the capers in which researchers engage. But only slightly. Problem is that wee p-values are as easy to discover as “outraged” Huffington Post writers. As explained in my above linked article, it will only be a small additional burden for researchers to churn up these new, wee-er p-values. Not much will be gained. But go for it.

What should happen

What’s needed is not a change in mathematics, but in philosophy.

First, researchers need to stop lying, stop exaggerating, restrain their goofball stunts, quit pretending they can plumb the depths of the human mind with questionnaires, and dump the masquerade that small samples of North American college students are representative of the human race. And when they pull these shenanigans, they ought to be called out for it.

But by whom? Press releases and news reports have little bearing to what happened in the data. The epidemiologist fallacy is epidemic. Policy makers are hungry for verification. Do you know how much money government spends on research? Scientists are people too and no better than civilians, it seems, at finding evidence contrary to their beliefs. Though they’re much better at confirming their opinions.

This is all meta-statistical, i.e. beyond the model, but it all affects the probability of questions at hand to a far greater degree than the formal mathematics. (Johnson understands this.) The reason we given abnormal attention to the model is that it is just that part of the process which we can quantify. And numbers sound scientific: they are magical. We ignore what can’t be quantified and fix out eyes on the pretty, pretty numbers.

Second: remember sliding wooden blocks down inclined planes back in high school? Everything set up just so and, lo, Newton’s physics popped out. And every time we threw a tiny chunk of sodium into water, festivities ensued, just like the equations said they would. Replication at work.

That’s what’s needed. Actual replication. The fancy models fitted by soft scientists should be used to make predictions, just like the models employed by physicists and chemists. Every probability model that spits out a p-value should instead spit out guesses about what data never2 seen before would look like. Those guesses could be checked against reality. Bad models unceremoniously would be dumped, modest ones fixed up and made to make new predictions, and good ones tentatively accepted.

“Tentatively” because scientists are people and we can’t trust them to do their own replication.

The technical name for predictive statistics is Bayesian posterior predictive analysis, where all memories of parameters disappear (they are “integrated out”). There are no such things as p-values or Bayes factors. All that is left is observables. A change in X causes this change in the probability of Y, the model says. So, we change X (or looked for a changed X in nature) and then see if the probability of Y accords with the actual appearance of Y. Simple!

This technique isn’t used because (a) the math is hard, (b) it is unknown except by mathematical statisticians, and (c) it scares the hell out of researchers who know they’d have far less to say. Even Johnson’s method will double current sample sizes. Predictive statistics requires a doubling of the doubling—and much more time. The initial data, as before, is used to fit the model. Then predictions are made and then we have to wait for new data and see if the predictions match.

Right climatologists? Ain’t that so educationists? Isn’t this right sociologists?

Caution: even if predictive statistics are used, it does not solve the meta-statistical problems. No math can. We will always be in danger of over-certainty.

——————————————————————-

1Actually a summary paper. See his note 21 for directions to the real guts.

2This is not cross validation. There we re-use the same data multiple times.

Comments

Johnson’s Revised Standards For Statistical Evidence — 17 Comments

  1. “We are not worthy!”

    “We are not worthy!”

    Create a model of reality and check to see if it is actually a model of reality. Be damn wary of your statistical glasses comparing the new data to the prediction.

    Going back to GCMs. “So did the temperature for Seattle on March 3rd, 1998 match the temperature the GCM said it would?”

    I.e. you create your model, populate it with inputs and drivers and check to see if the model matches the reality that was.

    But, you start asking questions down these lines and you find out.

    1. There is no Seattle in the GCM.
    2. There is no match for the day, week or month for any region on the planet.
    3. The only match is that “Statistically” the spread of temperatures matched the observed temperatures to within some magical… 95%…

  2. Even Johnson’s method will double current sample sizes. Predictive statistics requires a doubling of the doubling—and much more time. The initial data, as before, is used to fit the model. Then predictions are made and then we have to wait for new data and see if the predictions match.

    Not really. All that’s needed is to withhold some of the data when building the model and test against the withheld data. The drawback is the need to generate up to N models where N is the current number of samples. Not much different than building a model and waiting for more data.

    But if the test is whether the model produces the same distribution of Y as has been observed then you’re stuck with the problem of determining the sameness of two distributions and then — oops — p-values start reappearing. Is there any way out of this?

  3. DAV,

    Withhold data? Would you trust the people we routinely criticize here to do this? Or do you believe that they would cheat a little?

    All it takes is one look at the results and subsequent tweaking of the model to invalidate the idea.

    There already exist p-value/Bayes factor “substitutes”, which allow you to see how your model would have done assuming the old data is new data. These are called “proper scores”. These can surely be done, and are better than p-values/Bayes factors because they give statements in terms of observables, but they are always highly preliminary. Because—of course!—the old data is not the new data.

    In short, we act as physicists do (or used to, before they starting making metaphysical arguments).

  4. Hey – even I understand and I am certainly not worthy nor a statistical wizard. How do we get this very instructive essay into the hands of politicians and policy makers who are bedazzled by the junk predictions?

  5. Briggs,

    Leaving the problem of cheating aside, conceptually, withholding is indistinguishable from building a single model on all of the data at hand then testing against new data (except, of course, you have less data for building the model). It has the added advantage of repeating this N times — a kind of self-replication. The problem of new data matching what you already have will exist no matter how much you have. Physicists also face this problem.

    Then predictions are made and then we have to wait for new data and see if the predictions match.

    It seems the real problem is what is meant by predictions matching. The new data rarely will exactly match the predictions. When can we say the match is close enough and how do we go about determining the closeness? I suspect it is this problem which led Fisher, et al. down the road of p-values.

  6. DAV,

    Actually, Fisher only danced around the idea and never came to it directly. You know what he thought of Bayesian ideas (fiducial probability, anybody?). But Fisher did think of the idea of scores, though. His work in ESP prediction card matching is still worth reading. From this Persi Diaconis was able to come to some fascinating mathematics. Marry that with some other work, and we arrive at the idea of proper scores. That just is the best way (you can prove) how to match (probabilistic) predictions to observations. I.e., every other method is inferior.

    I guess I should write about this.

    Re: waiting for new data. Tough luck. You can’t always solve a problem in one step. Believing that we can is what accounts for a large portion of over-confidence.

  7. You ratfink! You could have kept quiet about this, or loaded it up with some kind of spin about “more Bubba Bloviating from some dinky ag school in Texas,” or hinted this guy was a closet Tea Partier or Baptist. But NO! you gotta tell it straight, and now me and my colleagues all look like a bunch of data-fudging monkeys.

    I told my students they could keep their confirmed research hypotheses with p-values just slightly less than 0.05, and now they’re going to find out that they need way more evidence, way larger samples, to get those wee p-values down in the 0.001 neighborhood. And all their previous work is “no longer covered” with those p-values of 0.049937, etc. I’m hoping there will soon be a Federal grants program to subsidize larger sample sizes for low-income research assistants, otherwise there will be rioting in the streets.

    Worse yet, I’ve gotta rewrite a bunch of lectures and case studies to incorporate the New, Improved, P-Values. AND come up with a plausible explanation of why %5 is bad, but 0.5% is good. I might even have to read Dr Johnson’s article.

  8. I guess I should write about this.

    Please do. I’m looking forward to it.

    Re: waiting for new data. Tough luck. You can’t always solve a problem in one step.

    Agreed but I see advantage in running up to N experiments vs. only one then waiting for N someone elses to repeat it. No matter how many experiments have been run, one can never be certain the next time will yield the same results.

  9. “Bayesian posterior predictive analysis … isn’t used because (a) the math is hard, (b) it is unknown except by mathematical statisticians, and (c) it scares the hell out of researchers who know they’d have far less to say.”

    Computers can do hard math if the mathematical statisticians tell them how. Might there be an open source package that lets us investigate without having to be math savants?

  10. Gary,

    I use JAGS. But the problem is that this requires, for every problem, substantial coding, and in-depth knowledge of statistics.

    (It’s side problem is that it uses the idea of MCMC, i.e. pretending it has made up “random” numbers. It’s not strictly necessary to do this. Shhhh.)

  11. Great post,thanks Briggs.

    “double current sample sizes. Predictive statistics requires a doubling of the doubling—and much more time”. Thats gonna be a problem, good luck with fixing that up! Finding enough patients for your trial is already a problem.

    What advice would you give someone trained in the statistical approach (p value-hunter, frequentist etc.) to learn the correct way of doing statistics in their research? Where do they start to retrain themselves? here is a relevant quote from marcus Aurelius:

    “A man should always have these two rules in readiness; the one, to
    do only whatever the reason of the ruling and legislating faculty
    may suggest for the use of men; the other, to change thy opinion,
    if there is any one at hand who sets thee right and moves thee from
    any opinion. But this change of opinion must proceed only from a certain
    persuasion, as of what is just or of common advantage, and the like,
    not because it appears pleasant or brings reputation.”

  12. Briggs,
    Thanks for the ref, but it looks worse than I thought (tm climate science). This is what I loathe about math — nobody knows how to make it accessible to the moderately intelligent curious layman.

  13. “the shirt matches his hair”.

    There’s something going on. To do with the ripped pants in ‘frisco. Weird things can happen anywhere but it might have been the original Glenn Miller band that you heard playing.

  14. I can’t see getting worked up about Johnson’s work. The 0.05 standard is arbitrary. If we feel it’s too high, let’s lower it. But no amount of math will make the choice of level any less arbitrary.

    I have another philosophy for hypothesis testing. Given possible events {E_n}, and the occurrence of one E_k, and a hypothesis of corresponding probabilities “H0:{p_n},” reject H0 if sum {pn: 0.05*pn < pk } < 0.05. Thus, reject when a true H0 would imply there exist relatively much more likely events, and collectively one of these events was quite likely to have occurred.

    If you work this out for the standard normal, you see it corresponds to a p-value of about .0017, much lower than the usual .05. But so what? It's just an arbitrary choice of level, and you could work backwards to find a level in Method A that corresponds to a given level in Method B.

  15. SteveBrookline,

    I disagree. Johnson’s math says many important and useful things. It ties together neatly what many suspected could be tied together. His effort could have the effect of removing even more people from the frequentist continuum, and that is a good thing.

    For example, did you know, that standard linear (normal) regression is exactly the same in both frequentist and flat-prior Bayesian theories? Interesting that. Why? And so on and so on.

    Your suggestion, like Johnson’s, of merely lowering the p-value is only a minor patch for all the reasons I’ve laid about above, and in hundreds of articles over the years.

  16. Pingback: Johnson’s Revised Standards For Statistic...

  17. Agree on your comments re: soft sciences and the ease with which one can create meaningless low p-values.

    For example, today I read about a study which purports to show that households with guns increase the probability of suicides, relative to the total historical number of suicides, by .5%-.9% for every 1% increase of the number of households with guns. No p-values were given, but the study’s very premise (that such a tiny absolute distinction would be observable in the number of suicides based on one variable, i.e.,”owning guns”, holding all else constant), to me at least, was transparently absurd. I like to read of these studies because I learn of new things which are not true.

    But…….I think you are, to some degree, shooting a straw man. Yes, many studies in soft science fields, economics being one, are being done to get published and to be published they need low p-values. And it is shocking how many economists fit data in multi-variable models to get low p-values and think that their conclusions are de facto evidence of something meaningful.

    The Federal Reserve used to target M1 as one of the predictors of employment. They had a model which they used for over 30 years. Each year it was “updated” to adjust for new data. Each year it failed to predict. After 30 years they gave up. M1, once one of the most watched numbers on Wall Street, is now irrelevant.

    I don’t see, however, why that makes p-values, per se, stupid or non-sensical. I think having a single “cut-off” point may be non-sensical. Depends on the circumstances. I think poorly constructed experiments are non-sensical. I think soft science may sometimes seem like an oxymoron—but not always. But if I want to bet whether a coin is weighted or cards are being fixed, p-values work fine. Yes, they are “known” distributions . But sometimes straight forward previously “unknown distributions” are observed experimentally which are so unexpected and so unlikely to be random, that p-values can be useful in determining just how unlikely and whether to pursue further analysis.

    Your critique of p-values is a form of the multiple comparison critique. That is likely why many have suggested decreasing significance thresholds by large factors. But then we just introduce more type 2 errors.

    I guess I am just saying that p-values can be helpful. Took me long enough.

    (I just discovered this website. It is great. These are “off the cuff” comments which help me think.)