# The Truth of Things: When Probability & Statistics Cannot Be Used

*I am always struggling with (my limited ability of) finding ways to describe the philosophy behind logical probability, especially to people who have a difficult time unlearning classical frequentist theory. This post is more for me to test a sketch of an explanation than to be complete explication of that theory. I am writing to those who already know statistics.*

If a theory—or hypothesis, argument, whatever—cannot be deduced it remains uncertain. Formally or informally, probability is used to quantify this uncertainty.

Consider *your* trial for murder. Your guilt must be established in the collective mind of the jury “beyond reasonable doubt.” That phrase acknowledges that the *certainty* of your guilt or innocence is unattainable—*in their minds*—but its probability can still be established.

Incidentally, whether the phrase “beyond reasonable doubt” is a historical accident or the result of careful logical reasoning is irrelevant. It’s common meaning is enough for us.

Through an obviously informal process, each jury member begins the trial with an idea of your guilt, which is modified as each new piece of evidence arises, and through discussions with other jury numbers. There are mathematical ways to model this process, but at best these models are crude idealizations.

Bayes’s formula can illustrate how jurors update their evidential probabilities, but since nobody—even the jurors—knows how to verbalize how each piece of evidence relates to the probability of guilt, these models aren’t much help.

A probability model can be written in the form of a statement: “Given this background evidence, my uncertainty in this *observable* is quantified by this equation.”

The background evidence is crucial: it is *always* there, usually (unfortunately!) tacitly. All statements made, no matter how far downstream in analysis, are conditional on this evidence.

You possess different background evidence than does the jury. Your evidence allows you to state “I did it” or “I did not do it.” In other words, this special case allows you—and only you—to say, “Given *my* experience, the probability that I did the deed is zero”; or one, as the case might be.

Deduction then, is a trivial form of probability model.

The observable is in this case is the crime: in statement form “You committed the murder.” The truth of that observable is not knowable with certainty to the jurors given any probability model.

The probability model itself is where the confusion comes in. It cannot exist for this, or for any, unique situation.

A classical model a statistician might incorrectly use to analyze the jury’s collective mind is “Their uncertainty in your guilt is quantified by a Bernoulli distribution.” This model a simple mathematical form, which is this: θ, where 0 < θ < 1. Notice that those bounds are strict.
People mistakenly call the parameter θ of the Bernoulli "the probability" (of your guilt). It is not---*unless* the parameter θ has *been deduced* and equal to a precise number (not a range). If we do not know that value of θ, then it is just a parameter. In itself, it means nothing.

The probability (of your guilt) can be found by accounting for the uncertainty in the parameter. This is accomplished by integrating out the uncertainty in θ—essentially, the probability (of guilt) is a weighted average of possible values of θ given the evidence and background information.

But just think: we do not need θ. We have a unique situation—your trial!—and the jury, not you, ponder your probability of guilt; they certainly do not invoke an unobservable parameter. The trial evidence modifies actual probabilities, not parameters.

Now, if we wanted to analyze *specific* kinds of trial*s*—note the plural: that “s” changes all—then and only then, and as long as we can be exact in the kinds of trials we mean, we can model trial outcomes.

This model is useless for outcomes of trials we have already observed. And why? Because the evidence we have is the outcomes of those trials—whose outcomes we know! Silly to point out, right? But surprisingly, this easy fact, and its immediate consequences, is often forgotten.

Another way to state this: We only need to model uncertainty for events which are uncertain. We can model your trial, but only assuming it is part of a set of (finite!) trials, the nature of which we have defined. The nature of the set-model tells us little, though, about your trial with your unique evidence.

The key is that there does not exist in the universe a unique definition for kinds of trials. We have to specify the definition in all its particulars. This, of course, becomes that background information we started with. Under which definition—*exactly!*—does your trial lie?

It is the uncertainty of ascriptions of guilt in those future trials that is of interest, and not of unobservable parameters.

Oh, remind me to tell you how mathematical notation commonly used in probability & statistics interferes with clear thinking.

It appears that you view the difficulty of applying statistics and probability to the trial or any other complex event in the same way that Fisher came to view it. One cannot use probability to establish the truth or falsity of a complex hypotheses because such is not like a simple Bernoulli trial. Rather prior information and the consequences of being wrong enter into the assessment.

About ten years ago I found in the University of Pennsylvania bookstore, a book by Richard Royall, entitled

Statistical Evidence. This monograph not only clarified the thinking behind classical and Bayesian statistical methods for me, but gave me a new tool through which to examine data — Likelihood. I worked backward from Royall to A.W.F. Edwards, and at some point I’ll regress so far as Laplace I suppose.One of the most insightful things in Royall’s work was that the various statistical methods answer different questions. Classical tests of significance answer the question “What should I do?” Bayes theorem answers “What should I believe?” and Likelihood answers “What does the evidence say?”

This is not a sophisticated idea, but I’ve decided that I need all these tools, and more, to work my way through complicated problems.

Good morning, Mr. Briggs,

(anyone old enough to remember the early Mission Impossible show)

I read your blog with interest. I’ve struggled with stastical issues many times. I remeber when I was in school I had to take the course “Advanced Physics Lab”. The textbook for the course was one on stastical methods. I elected to determine the velocity of second sound in He II. This involved measuring the propogation velocity of entrophy waves in superfluid helium. I was harranged several times for my stastical analysis of my data. If I remember correctly, the value was about 22 m/s.

Later in my career I participated in determining the installation errors of guided missiles on aircraft. We had only a few aircraft to measure. After much hand wringing, I observed that we had made an error by looking at the data as pitch, roll and yaw. It turned out we were over estimating the random error by not seeing a tooling bias. By looking at caster, camber and toe out, it was obvious that there were symetrical tooling errors that were not random. The left hand missile stations were toed out to the left, the right hand stations toed out to the right. These could be corrected, reducing the three sigma error quite a bit. It turned out not to be very important, the errors were all small anyway.

Matt:

I am not sure I understand what you are trying to explain.

There are a limited set of situations that are defined so tightly that context is by definition not relevant. For this type of situation frequentists can apply their tools. For a vast array of other situations where efforts to discount context have no basis and where more than one distinct outcome is possible, the significance of the context in determining the actual outcome is important.

In your murder trial example, the prosecution has a model for presenting the context – means, motive and opportunity. This model is by no means perfect/complete/relevant but it serves to organize the context and evidence thereof the prosecution wants the jury to consider. The defence presumably has a similar model plus the issue of additional suspects, whether a murder had actually been committed, mitigating circumstances, etc.

Individual jury members may adopt either, both or neither of the explicitly or implicitly proffered models and they may accept or reject the evidence based on their personal assessment of its relevance or credibility that supports or contradicts the relevance of part of the jury member’s tacit model (which may or may not reflect the proffered models.).

The inherent complexities of such a decision process means that we should not be surprised that jury’s get things wrong (Somebody later admits and provides proof positive that they did it.) Lawyers know that jury’s are unpredictable.

Is this on point?

Bernie, John Galt,

Will respond later today…

Years ago I attended some A.W.F. Edwards lectures delivered to fresher Natural Scientists at Cambridge. Clarity thy name is Edwards! It reinforced my diagnosis that a common difficulty that students face with elementary probability and statistics is that they tend to be subjected to muddled posing of problems by their lecturers – I know I was. Once I’d realised that I had to intuit what the lecturer “really meant” from what he said, I got into the habit of rewriting the problem so that it made sense, and then solving it. I tried, however, to be more diplomatic than that when answering examination questions. Often I succeeded.

dearieme:

Cambridge. Which college? I was at Queens’ – 1969 to 1972. I didn’t hear Edwards. The econometrics lecturers were incredibly – how shall I put it – opaque or I was dimmer than I thought. Come to think about it – the only really stimulating lectures I attended were by George Steiner. The economics lecturers – Kaldor, Kahn, Robinson, Galbraith – seemed to be caught in some kind of time warp.

@ Bernie, I went to Edwards’ lectures because I was supervising natsci maths, and I thought I’d go to the lectures to learn where the arguments started from, what was held to constitute a proof, what notations were used, and so on. Some of the lecturing wasn’t much cop – an account of Green’s Functions one Saturday morning was such rubbish that I angrily stalked to my lab, sat down and wrote the lecture that should have been given, photocopied it, cycled to the lecturer’s college and put it in his pigeonhole. The next lecture, on the Tuesday morning, started with “I’ve had second thoughts about Green’s Functions” after which he delivered the lecture I’d written for him. Prat! I also remember the first couple of lectures of the year. “Have you met vector addition?” Of course they bloody had, but he plodded methodically through it. “Have you met dot products?” Ditto. “Cross products?” Ditto. Then “Have you met vector triple products?” A handful of the hundreds present admitted that they hadn’t, so the bugger accelerated!!!!

Anyway, Edwards shone against that background. And he did reach Likelihood.

dearieme:

You have great recall. Green’s Functions are but dim and ill-comprehended memories for me. Saturday morning lectures also tended to be anathema to most Econ students. You must not have spent much time at the Eagle and the Fort St. George!

Bernie,

Sorry for slow response.

I, and other (subjective and objective) Bayesians, say that frequentism is irrecoverably flawed. One reason is that philosophy’s disregard for context, or in failing to realize that context, i.e. information, is all.

Instead, in frequentism, “events” are what matters. An “event”

mustbe envisioned to be a measure embedded in an infinite—not very large, but infinite—sequence of “similar” events.Take the stock coin flip example. In what sequence is that flip embedded? Before answering, don’t forget that a coin flip is predictable

conditionalon the knowledge of the coin’s and physical environment’s characteristics, etc. In other words, the outcome is not random (unknown) when conditioned on one set of information, but it can be conditioned on another.But still, what kind of flips are we talking about? Those which occur on Tuesdays? By right-handed flippers? When it is 70

^{o}F outside? Well, these are all context. In logical probability (LP) they are either explicitly used as conditioning information or they are not. There is no mechanism in frequentism to condition on relevant or irrelevant information. What matters, once more, is the sequence. But defining that in terms of real objects is an impossibility. The least problem is the requirement of infinite numbers of events, which are nowhere physically possible. The most damning problem isuniquelydefining a sequence.The jury trial example is not meant to offer insights into whether or how likely juries err. It’s meant to show that unique events cannot be embedded in infinite sequences. Probabilities of unique events in LP can be had, but

modelscannot. In LP, it is possible to define, using information thought relevant (it matters not whether it actually is; another strong point) to define a finitesetof events. That is, sets of events can be modeled. (I don’t necessarily mean “set” in the usual mathematical sense; I just mean finite.)I have often used the presidential election metaphor. What is the probability that Hilary Clinton gains the Democrat nomination for president next election? In LP, the best we might do is bound this (less than 0.5), but it cannot be modeled at all in frequentist theory. What possible

uniquesequence can that event be embedded in? All women running for office? All rivals to sitting Democrat presidents? All people who last names start with “C” running for office?For the unique event of Mrs Clinton gaining nomination, no model is possible, even in LP, unless you can extend that event to be part of a set, or series.

And we haven’t even begun to talk of parameters versus observables!

Matt:

Let me see if I can restate your argument in my own words.

Frequentists presumably argue that there are a large number of interesting events that are not too singular or unique, i.e., they are similar to other events. For example, one toss of a coin is similar to other tosses of the same coin or one murder trial with a given set of facts is similar to other murder trials with the same set of facts.

There is obviously something different in the two examples – the coin toss and the murder trial. The latter is a trickier empirical assertion and presumes a model of murder trials that specifies all relevant facts – where relevance is defined in terms of having a direct or indirect effect on the outcome. The former also assumes a model of coin tossing, but I would assert that the model contains far fewer and more controllable set of relevant facts.

That, under defined conditions, one can usefully apply frequentist tools for a coin toss does not mean that the same tools can be applied to all coin tosses or to any murder trials. The contexts have to be the same.

This seems to boil down to making assertions about the completeness of the model and the measurability of the relevant factors.

But then this line of reasoning creates a possible conundrum: the more complete the model of an event the more one should be able to predict the actual outcome of the event with a given set of facts with certainty, i.e., there is a Law. Not to be able to do so, means that the model is incomplete or false and, therefore, one cannot assert that the events are similar.

For example, I see no logical reason why one could not predict with 100% accuracy whether a coin would end up heads or tails given a correct model of coin tossing and a knowledge of the values of the relevant factors, i.e., a Coin Toss Law. If one argues that there are a set of facts that influence the outcome but cannot be controlled or known prior to the coin toss and yet have no predictable impact on the outcome, then we are involved in a logical contradiction.

The same logically should be true for murder trials.

So frequentists seem to argue that what they do not know does not matter.

Am I making progress?

Bernie,

Not too bad. Your conclusion is partially right.

Except that there is no difference between a coin flip and a murder trial, in the matter of unique events. For coin flips, I think we fool ourselves with their similarity and by repetition of the example. It’s easy to imagine the differences between trials, and not so easy to do so for coin flips. That is, we readily mentally embed coin flips—coins of any kind under any circumstance—into imagined infinite sequences. But for trials this is more difficult.

However, the situations are the same. For

thiscoin that I will flip now and only once (and then destroy in a vice or on a grinder), LP can offer a probability—but only based on specifically articulated evidence. And what’s that? Well, our past experiences, etc. This evidence may be sufficient to let LP specify a precise number (say 0.5), or it might be more ambiguous and only allow a range.I think I should skip early introduction of coin flip examples, and stick to situations like Mrs Clinton’s nomination, an example which is far more difficult to imagine belonging to some

infinitesequence. Once more, “very large” won’t do: it must be infinite or nothing for the mathematics of frequentism to obtain their interpretation.Thus far, all I say is agreed to by all objective Bayesians and students of LP. I differ slightly in my theory of measurement, but which all observables are finite in extent and resolution. De Finnetti had ideas similar to this. (The logical “theory of measurement” differs from probabilistic

measure theory, which is a perfectly fine branch of mathematics—however, just because something is a branch of mathematics doesnotmean that it has relevance to the world.)Matt:

Conceptually, I agree that there is no difference between a coin toss and a murder trial. I was thinking in terms of simple and complex experiments. Both types of experiments require the systematic control of factors that may impact the outcomes. I would simply assert that the more complex the experiment the harder it is to know and to control moderating variable. Hence it is harder to control or standardize a murder trial.

Avoiding infinite sequences is double-edged. It seems to me that replication has both definitional and conceptual implications. For example, how likely is it that a woman would become President by 2020? This can be seen as a series of experiments or trials where the outcome of an earlier trial tells us whether we are paying sufficient attention to all the right factors, i.e., our model changes with each experimental result. Of course, these trials are also not independent.

Your last point about “relevance to the world” introduces the notion of utility. This suggests that another question to pose is for what types of real world questions are the frequentists assumptions benign? If the range of real world questions is very narrow, then frequentists are more of a distraction than they are worth.

I was convicted by a magistrate (no jury) and spent three months in prison. The magistrate said that I was guilty “on the balance of probability”. Witnesses (3) for the prosecution said in turn that they were 90%, 80% and 70% certain that I was the man wearing the black hat on the night in question. (These people knew me and hated me. Or at least my existence threatened their sense of status quo.)

I haven’t studied probability since high school. I was allowed to speak before I was sentenced. I told him that that the probabilities testified by the supposed witnesses amounted to 50.4% so that in some bizarre way he might be correct. But of course I’d already told him that I was not wearing a black hat and not where witnesses had perjured themselves to say. In short, my estimate of probability was zero.

Probability of an event after the fact (or in my case non event) doesn’t enter into it. If I, before the fact, select the winning numbers of a lotto draw, I’ll be pretty pissed if they refuse to pay on the grounds of “balance of probability” that I could choose the winning numbers..