All Those Warnings About Models Are True: Researchers Given Same Data Come To Huge Number Of Conflicting Findings

All Those Warnings About Models Are True: Researchers Given Same Data Come To Huge Number Of Conflicting Findings

Seventy-some researcher groups were given identical data, and asked to investigate an identical question. The groups did not communicate. Details are in the paper “Observing Many Researchers Using the Same Data and Hypothesis Reveals a Hidden Universe of Uncertainty“, by some enormous number of authors.

As is the wont of sociologists, each group created several models, about 15 on average. There were 1,253 different models from the seventy groups Each was examined after the fact, and it was discovered no two models were the same.

The question was this: Whether “more immigration will reduce public support for government provision of social policies.”

The answers are yes, no, or can’t tell. Only one group said they could not investigate the question. All other groups went to town.

The answer was standardized across the models, and called the “Average Marginal Effect” (AME). Clever idea. Here are all the quantified answers, plotted from smallest to largest AME. The 95% “confidence interval” of each model’s AME is also shown.

About half the models were somewhere in the middle, about a quarter said the effect was negative, and about a sixth said positive.

There were many, many, many wee p-values. There were wee p-values galore! Each “confirming” the researchers had the right answer, and that all the other researchers were wrong. Further, those CIs were nice and tight, “proving”, just like p-values, each model was right on the money.

Now, I don’t know about you, but when I saw this, I laughed and laughed and laughed and then laughed some more. I am laughing now. Later, I will laugh again.

There are many warnings about models we examined over the years, you and I, dear readers. Two that should have stuck by now are these:

1. All models only say what they are told to say.

2. Science models are nothing but a list of premises, tacit and explicit, describing the uncertainty of some observable.

The first warning is easy to see, and it goes some way in removing the mysticism of “computer” models (that a model was computed still impresses many civilians). Every one of those 1,253 models was a computer model.

The second warning I can’t make stick. Let me try again. By premises I mean all the propositions, or assumptions, observational or otherwise, that speak of the observable. This also includes all premises that can be deduced from the premises.

If notation helps, here is every model ever:

     Pr(Y | P_1, P_2, …, P_q),

where the P_i are some enormous long list of propositions (synonymous with premises). One might be P_j = “I observed x_j = (2,5,2,3,4…)”, i.e. some measure thought by the modeler to modify the probability of the observable Y in the presence of all the other premises. Both italicized phrases are crucial. (The probabilities of Y can be extreme, i.e. 0 or 1, like in many physics models; e.g. a gravity model F = GmM/r^2).

Too many think writing down, say, the statistics math is the model, the whole of it. That math, such as in a regression, is only a small, a very small piece of any model. Writing down only equations because that is easy (and it is) leads to the Deadly Sin of Reification, which not only inflicts sociologists, but every scientist, from physicists on down.

The temptation to say the math parts of the model is Reality (or “close enough” to it) is overpowering. The temptation is almost never resisted. I guarantee you it wasn’t by researchers in the paper above.

That’s where that “thought by the modeler” comes in to play. He might toss in an “x” to see how it does, because that’s what modelers are trained to do. But that carries with it tacit premises on the strength of the relationship between that premise and Y, and all the other premises in the model at that time. Since that isn’t easy to quantify, or impossible, it doesn’t show up in the math. And the premise is lost from view. It’s still there, though.

Also tacit are the humongous number of tacit premises that accompany data collection (where, who, how, when, etc.). How many unmeasured things about a measured “x” affect the model of Y? Many. But because we can’t quantify these, we forget they are there.

Incidentally, the reason for the prejudice toward math is because often researchers believe there is a “true model”. Of course, there will be true causes of Y, the truest model of all. But researchers weaken “true model” to mean “true probability model”. And there isn’t one.

There’s always a deducible (though perhaps not quantitative) locally true model given the modeler’s premises. But that does not mean the model is universally true, as causes are. (More on that here.)

I know I’ve lost a great many of you. Suffice to say the model is more, and much more, than the math that is written down or coded. That picture above, which is the best case scenario where the data was identical for all, proves it.

It’s worth wondering whether these 1,253 models were converted to their predictive form we’d still have the hilarious result.

We would, but not to the same extent. We’d likely drop to, as a wild guess, maybe 8-10% models that insist on high probability the answer was no, and others that answer was yes. Predictive methods cannot escape those hidden and tacit model premises.

Meaning if you tried to replicate, or verify the prediction, you’ll surely get some of the tacit premises, those that existed only in modelers’ heads, wrong. Your predictions will be weakened.

The gist? The author say researchers “should exercise humility and strive to better account for the uncertainty in their work.”

Humility is not known in science these days.

Therefore, my conclusion is: Stop trusting models so damned much.

Buy my new book and learn to argue against the regime: Everything You Believe Is Wrong.

Subscribe or donate to support this site and its wholly independent host using credit card click here. For Zelle, use my email: matt@wmbriggs.com.

15 Comments

  1. fergus

    Several decades ago a statistician I was working with told me there was a well known theorem/proposition/principle (I don’t remember the exact claim) which was that the strength/likelihood/power etc. of a model (some kind of “goodness”), given whatever statistics one was using, eg. p values, should be reduced by a function of the number of models one tried before the final one was accepted. I have always been curious if that was in fact something rigorously established or if it was just a colloquial rule of thumb, or if he was just jiving me. Perhaps someone who visits this site is aware of the notion and its basis, if any.

  2. Peter Morris

    I’m surprised the conclusion wasn’t, “More research is needed to understand these surprising results.”

  3. Dieter Kief

    fergus – it would be for you to decide, what your statistician colleague had in mind, because it is you who knew him.
    That said: He might well have tried to make you aware of the possibility of a regressus ad infinitum in case somebody tried to solve the problem of model fitness (or “goodness” as you called it) by using a formal strategy. – The thing is: The acknowledgement of the usefulness of a model can’t be formalised, because this property of any model is by its very nature a .v.a.l.u.e. .j.u.d.g.e.m.e.n.t. and thus falls into the hermeutical realm (- see Hans Georg Gadamer’s classic study Truth and Method – or, a bit more rigorous/concise: Jürgen Habermas: Between Naturalism and Religion – about the neokantian three value-spheres, that constitute our modern*** worldview – and how all three of them follow different ways of reasoning: 1) The subjective (religion, aesthetics), 2) the social (the law, norms, customs, ehtics, morals etc.), and 3) the objective (the nomological field of (all kinds of) measurements).

    *** modern as being opposed to postmodern

    btw. – thx. for noticing my hint and writing in such a great way about this paper above, Matt M. Briggs! – I’m delighted!

  4. Ye Olde Statistician

    I thought you were commenting on models in science, then I realized they were only models in sociology.

    Back in the 1880s, Pierre Duhem noted that if you presented the same experiment involving pressure to two physicists, one of whom followed the pressure theory of Laplace, the other of Lagrange, that one would find the hypothesis confirmed and the other would find it rejected. Each would formulate different equations, perform different calculations on the data, and most critically interpret the results differently, [Data have meaning only in the context of a theory.]

    So the warning is an old one, and no one has heeded it yet,

    I discussed some of this a few years ago. Find the section entitled “When models go bad.”
    https://tofspot.blogspot.com/2014/03/americas-next-top-model-part-ii.html

  5. Hagfish Bagpipe

    “These results call for epistemic humility and clarity in reporting scientific findings.”

    Ha-ha!-HAh!-ho-hee-AHH-ha-ha!-Oh-oh-ohHHH!-Ah-ha!-hee-hee!-HO-Ho-HaH!-Wee-hoo!-Ha-ha!-HAh!-ho-hee-AHH-ha-ha!-Oh-oh-ohHHH!-Ah-ha!-hee-hee!-HO-Ho-HaH!-Wee-hoo!-Ha-ha!-HAh!-ho-hee-AHH-ha-ha!-Oh-oh-ohHHH!-Ah-ha!-hee-hee!-HO-Ho-HaH!-Wee-hoo!-oh, oh, oh my…

    HUMILITY!!

    Ha-ha!-HAh!-ho-hee-AHH-ha-ha!-Oh-oh-ohHHH!-Ah-ha!-hee-hee!-HO-Ho-HaH!-Wee-hoo!-Ha-ha!-HAh!-ho-hee-AHH-ha-ha!-Oh-oh-ohHHH!-Ah-ha!-hee-hee!-HO-Ho-HaH!-Wee-hoo!-Ha-ha!-HAh!-ho-hee-AHH-ha-ha!-Oh-oh-ohHHH!-Ah-ha!-hee-hee!-HO-Ho-HaH!-Wee-hoo!…

  6. brad

    We had to model a cross flow heat exchanger in an air conditioner for a building. We used Excel to do this. 30 people in the class, we had 30 different answers. I really wish I had all 30 of those models. There was a paper buried in them. Hindsight and more years of experience with excel, modeling, and programming makes me point at certain parts of the calculations. Nusselt, Prandlt, Reynolds, and Welty, Wicks and Wilson all collided.

    I suspect, if I were to look at the array of processes, that many of the solutions had denominators too early. Where the denominators were first used can easily have varied from model to model. The joyful process of guestimating Nusselt and Prandlt can’t be ignored though. Anyone waving away those two blokes when modeling any climatic system should be sent to the back of the line. The challenge of convection can quickly get lost in the boundary layer. Laminar flow is wonderful. It happens in nature. Lenticulars are a beautiful thing to behold. The backside of those lenticulars are rotors. Rotors are not completely unrelated to tornadoes.

    But a guy with a computer and excel can make a million row by XLZ column matrix and populate it with a few clicks and make a matrix that looks like it changes like nature changes. Awesome… Can it predict 10 minutes from now? (Ladies are free to do this also. I am perfectly happy to disparage ladies doing this as much as I will men). I only disparage the ones who think they can predict 50 years from now when they can’t predict 50 minutes.

  7. brad tittle

    In Covid news: Sorry OT. My local mortuary is still doing a banging business. Running 80% over pre kerfuffle levels. Mystery deaths.

    Apparently another mortuary in our area is also doing very well. He purchased the mortuary just before Covid hit. He has now paid his mortgage off.

    Also Mystery Deaths.

    I hear hints of this from other locals. I see hints of it in Steve Kirsch. I see hints of this in the reports from the Life Insurance folks.

  8. That’s the most “send to other people” paper I’ve seen in years. Thanks for finding it!

    P.S. So what part of “social science” survives this?

  9. JH

    Reasons that 90% of the total variance remains unexplained include the model employed may not be adequate, features chosen/measured are not relevant to the target (using the language in statistical machine learning), and too many features are included in the model. For example, if I use foot size and hair color to explain the student exam score, the total variance unexplained would probably be more than 90%.

    Anyhow, diagnosis (a nice way of saying finding faults) are usually easier than coming up with solutions. Ascertaining data of quality would require financial resources and brain power.

    I wish the first three authors had attempted to answer the hypothesis themselves and reported the difficulties in reaching an acceptable answer. They sure have done impressive analyses, be it adequate or not. I cannot tell if they exercise humility because they simply report the results they obtain.

    Well, if you are a social scientist, consult a competent statistician!

  10. Back in the “dark ages”, this was drilled into me: “Don’t make vast conclusions from half-vast data”, and one of the take-aways is that one should be humble and honest enough to think “I am the easiest person to fool”, here.

    Excellent, our most gracious host.

  11. There’s an old accounting joke (yes, they exist), that explains these results.

    Interview question: a company buys widgets at $4/unit and sells them at $5/unit; if they sell 50,000 widgets/year and have $10,000 fixed costs per annum, what’s their income?

    Young accountant 1: $40,000
    Young accountant 2: $40,000
    Young accountant 3: Those fixed costs are missing the [incomprehensible jargon for 20 minutes], so the company has a loss of $20,000, to be pro-rated against future profits.

    Young accountant 3 is a promising candidate. But the job goes to…

    Experienced accountant: Is this income being reported to banks ($100,000), management ($40,000), tax authorities ($50,000 loss), or potential investors ($250,000 and growing at 20%/annum)?

    The only surprise is that sociologists seem to have caught on to accounting practices.

  12. JohnK

    Kip Hansen’s take on the same study is of interest. He notes that a multiplicity of independent analyses did NOT converge on an answer, let alone The Answer.

    Pal Review — love that phrase, Kip — clearly doesn’t work, either. His implicit question: then what are we left with? Is there a magic formula (a dehistoricized, ever-applicable Meta-Model) we — or at least, The Anointed — can apply to decide between many differing models and their many differing results?

    Mike Tyson may have said it best: Everybody has a Model until they get hit in the face.

  13. DAA

    JohnK: good point with “Mike Tyson may have said it best: Everybody has a Model until they get hit in the face.”

    What would one trust more before going out to sea: a model or a fisherman with decades of experience?

Leave a Reply

Your email address will not be published. Required fields are marked *