What is
A crime has been committed. Evidence—probative background information—points to a list of suspects, a list which, as always, might be incomplete. A most likely candidate is made to stand trial. Evidence for his guilt is provided by a vigorous prosecutor. His defense is conducted by a meek public defender. The jury must estimate the probability that the accused is guilty, and if they feel this probability exceeds reasonable doubt, they must vote to convict him. Before any evidence is heard, many jurors reason that the accused is likely guilty, else he would not be on trial. Thus, instead of considering the question whether the accused is guilty, the jury decides to opine on the likelihood of the evidence given the accused were innocent. Since the evidence nearly always appears rare or usual assuming the defendant were innocent, most trials result in a conviction.
That is what classical statistics—both in it frequentist and Bayesian incarnations—is like. Some thing has happened, and a hypothesis is put forth which, based on probative background information, the investigator thinks likely caused the thing. Evidence—mostly in the form of “data”, but non-quantitative information too—for this hypothesis is put forth. Very little is offered in rebuttal. The hypothesis is also one of many possible, the list of which is probably incomplete; that is, the true hypothesis might not be considered in a given trial.
The jury consists of mathematical formulae whose duty is to report solely on the likelihood of the evidence given the hypothesis is false. And since the evidence nearly always appears rare assuming the hypothesis is false, most trials result in a conviction—meaning, the investigator’s viewpoint is confirmed.
What should be
The problem is that most events cannot, or should not, be coerced into the format of a criminal trial. It is often less important about pining down an exact cause of a non-unique events than in understanding and predicting them. Anyway, events—the “things” above—are rarely unique like particular crimes are. We do not always want to say that this variable caused this unique, singular, one-of-a-kind event. Instead, most events of interest are part of a larger structure; a stream of similar things.
For example, the weather or climate. Trying to fit carbon dioxide into the frame as the culprit for a historically observed increase in temperature is vastly less interesting and useful than in being able to accurately predict what will happen. It is easy enough to find enough evidence to convict our poor gas classically. But if it were part of a gang of gases, the others would go free. Meanwhile, in celebrating our conviction, we would remain ignorant of what will happen, or will issue poor predictions because we were so intent on assigning blame.
Or take comparing fertilizers. The classical way would be to say, seemingly authoritatively, that the hypothesis that fertilizer A is the “same” as fertilizer B has been “rejected.” When what is more important would be to say how much better fertilizer B is than A and under what circumstances.
Just think: in the data we have collected, in the particular historical circumstances which led to their collection, we know everything there is to know. We know whether fertilizer A was better than B: just count yield for both brands! I use the word “know” in the sense of rigorous proof, and do not mean “likely”—I mean certain. Classical statistics summons its forces to say something about the cause of this past data, when we should be trying to say something about data we have not yet seen, and about which we still are in the dark.
If investigators reported statistical results in terms of explicit predictions, instead of blank announcements of culpability (tables of variable names and p-values), then we would easily be able to see whether the culprits they fingered were guilty. It’s easy to do this, too. A paper in sociology could announce, “Input the values of these variables—chosen from this and that source—and then the outcome of interest is likely to be in such and such bounds.” (The technicalities of this we can discuss later.)
Those interested in the model would quickly discover whether the predictions had any value—because they could check them on new and independent data. If the models worked, then the causes asserted by the authors would carry more weight. But if the models failed—and many would—then the authors’ theories could be rejected. Which itself is a tremendous service.
Once more, we’ve reached our limit of words, and I have not done an adequate job explaining this. But stick around; the words might come.
We know whether fertilizer A was better than B: just count yield for both brands!
You attributed all the difference in yields to the differences between fertilizers, leaving out all other factors. How did you know you could do that?
Amen, Brother, to the criminal trial analogy; that’s exactly how I explain hypothesis testing to my undergrads. However, I hadn’t taken that next step to the “gang” problem–looks like a great lead-in to multiple regression.
My favorite example of the limits of classical hypothesis testing is the chi-square goodness-of-fit test, which is no such thing. The only conclusive result supported by evidence is when the null hypothesis (the null distribution) is rejected, making it a “badness” of fit test. The test doesn’t tell you what a distribution is, only one thing that it isn’t.
Really cool analogy.
Rich: The previous sentence: “in the particular historical circumstances which led to their collection, we know everything there is to know.” I think the following sentence could perhaps have been more accurately worded to read, We know whether fertilizer A fared better than B… but the current wording is more fun and emphatic 🙂
Though I agree with the underlying point, some of my personal experience as a juror interfered with the analogy. For instance, in two separate trials at which I was an alternate (alternates have to sit through the entire trial but then –unless somebody gets sick or something– are required to wait in another room while the ‘real’ jury deliberates), the evidence was so strong –the guy was caught with the goods, on the scene– that about all the defense could offer was the classic “the other dude did it.” In one case the ‘other dude’ was simply a fabrication by the criminal (just like on ‘Cops’); in the other, the other guy happened, by some wild coincidence, to be traveling in the same drug-delivering car.
But in neither case did the jury consider the likelihood that ‘the other guy’ could actually have done it, or done it alone. If my fellow jurors had spent even a minute examining this question, they would have concluded, as I did, that no ‘other guy’ could have acted like or done what the putative ‘other guy’ would have had to act like and do in order to make the guy we were trying innocent.
For instance: in one trial, the guy was accused of stealing a ‘dune buggy’ (a modified VW bug, with only 2 seats). He claimed to have ‘gotten a ride’ from ‘the other guy’ (no name, nowhere to be seen) who actually stole the car. Yet the police had taken a Polaroid of the vehicle interior on the scene. Not one person, even the (assistant-assistant) DA, noted that there could not have been more than one person in that vehicle, since the passenger seat was absolutely filled with leftover Fritos wrappers and other detritus to terrible to mention. Nobody would have sat on that seat!
As it was, one of my juries (that one) hung, and the other (a serious drug case) let the guy go (after all, he had a family, and had cried on the stand). Way to protect society, guys!
So in my cases, seemingly all the defense had to do was offer any hypothesis whatever, and it served as ‘reasonable doubt’.
You can see why my experience would interfere with your analogy!
This concept should apply to new law. If a law is proposed that claims it will achieve some effect– then the law should also include a test-metrics- to insure within some time period that the effect is indeed produced. Failure to attain the promised effect after the agreed upon time period should cause the law/regulations to become null and void.
“The jury consists of mathematical formulae whose duty is to report solely on the likelihood of the evidence given the hypothesis is false.”
I think you mean “…is true” there and in the following sentence.
The choice is between who wins- the defense or the prosecution. The truth is incidental.
The other guy did it.
A friend of mine had his car stolen. Some time later, it was used as a getaway vehicle in a robbery. The theif was captured, the car destroyed. The thief said, the owner lent me his car. My freind was cited as an accomplice in the robbery. After a call with the DA, the charges were dismissed.
The legal analogy for statistics / research — the researcher is investigator, prosecutor and judge. Peer review, would then serve as ‘appeal.’ Appeals are based on the evedience sent to the appeals court. This is the ‘gold standard’? No wonder so much junk gets published.
“It is easy enough to find enough evidence to convict our poor gas classically. But if it were part of a gang of gases, the others would go free. ”
Actually it is a gang of gases. Water vapor is the major (so called) greenhouse gas but it is ignored by the models.
Meh.
I think you missed the point here, William. [As an aside, I’ve decided that from here on out, when I agree with you I’ll say something like “knock the rock, Matt”, and when I disagree with you, I’ll express my disappointment and call you “William”.] And once again, I think you’ve sold classical statistics short.
Deming was fond of saying, “The only useful function of a statistician is to make predictions
and thus to provide a basis for action.” For all his quirks and unconventialities, Deming was very much a classical statistician.
And every competent statististical practitioner I know (and even some marginally competent ones) know that the purpose of the fertilizer study is NOT to pick a winner in this field, in this season, with this crop, but rather to determine which fertilizer should be used in the future, under what conditions, and what results to expect from it. There may be hacks advocating for the former, but any decent classical statistician knows that the purpose of the study is the latter.
Regarding your fertilizer example, I believe this is where the Bayesian approach is more natural. The emphais in Bayesian statistics is on the uncertainty in the parameters, not the data. Said another way, Bayesians condition on what is known — the data — rather than what is unknown — the parameters.
Still, this is minor compared to your larger point that the test of any analysis — Bayesian or frequentist — is how well it predicts new data. I’d have more confidence in a dubious method that routinely and accurately predicts the future than in a rigorous analysis resting on unexamined modeling assumptions.
“Innocent until proven guilty†may be a myth, but in a classical statistical hypotheses testing, before any conclusion is made, the data evidence is evaluated under the assumption that the null (innocent) is true.
I disagree. The purposes of statistical modeling, classical or Bayesian, are to understand the uncertainty in data and relationships among variables, to make a decision or prediction about what we haven’t yet seen, and others. Just because some bad statistical practitioners do certain things, it doesn’t mean that “classical Statistics†do those things.
Concerning your fertilizer explanation I would say that in practice we are trying to estimate the size of the effect of fertilizer B and that the hypothesis is that the effect is 0. The conclusion would be valid under the experimental circumstances.
RE: Cross Validation is not an independent test.
https://www.wmbriggs.com/blog/?p=3272&cpage=1#comment-32646
It’s not a test technically. It’s a model selection method (cv). It’s a test in a way because one can use cv to decide whether adding an extra variable in the model yields better prediction result.
I think I know what you mean by independent test, to fit your point of using new (and independent) data, one can use the new or most recent data as the validation set. Let’s note that how to partition data in cv is unprincipled one, and there are many other issues to be considered. For example, how many “most recent†data points would one need to conclusively determine if the model predicts well?
Adam H: ah yes, it does say that. I guess I’ve just got used to editing out claims like, “we know all there is to know”.
For another good presentation of the analogy between legal and statistical tests, see
http://www.intuitor.com/statistics/T1T2Errors.html
I disagree with your claim
“That is what classical statistics—both in it frequentist and Bayesian incarnations—is like.”
Your description
“The jury consists of mathematical formulae whose duty is to report solely on the likelihood of the evidence given the hypothesis is false [the . And since the evidence nearly always appears rare assuming the hypothesis is false [the accused were innocent], most trials result in a conviction—meaning, the investigator’s viewpoint is confirmed.”
is valid for frequentist statistics only.
In Bayesian paradigm one has to compute P(evidence)=P(evidence|the accused is guilty)P(the accused is guilty)+P(evidence|the accused is innocent)P(the accused is innocent), so, according to your words, “jury consists of mathematical formulae whose duty is to report the likelihood of the evidence given *both* the hypotheses [the accused is innocent and the accused is not innocent]”, *explicitely*.