Uncertainty & Probability Theory: The Logic of Science
Video
Links:
Bitchute (often a day or so behind, for whatever reason)
The Infamous Coin Flipping Machine!
HOMEWORK: We did Pr(M_6|E) = 1/6, and we did Pr(M_6|E + “fair”) = 1/6 (a circular argument!). Now give us Pr(M_6|E + “unfair”) = ?
Lecture
The homework used as evidence:
E = M_1 or M_2 or M_3 or … or M_n (n < infinity)
One and only one M_i must obtain
n = 6
Pr(M_6|E) = 1/6. Which we conclude from the statistical syllogism. But that syllogism is beloved everywhere, and so various proofs of it, operating from simpler premises, have been offered. Two such approaches, by Jaynes and Diaconis, I think are failures, because they end up being circular arguments. Stove is best, which we’ll do next time. This time we go over Jaynes and mention Diaconis.
Meanwhile, adding “fair”, as we’ll see in the excerpt below, does nothing for us. There are no such things as “fair” dies, or coins, or anything. Adding “fair” to a list of premises assumes what it sets out the prove; that is, that each outcome is equally likely. You do NOT get to add information to this E. Somebody commented that one side of the die may be shaved, or whatever. That applies to that die. This die does not have that information: that is, we do not have it. We cannot wily nily add to E. We take it as it is. If we do add information to E, then yes indeed, we change the probability.
Here, again, is the one single lesson of this entire class: change the evidence, change the probability.
All we have done so far mathematically, apart from the epistemology, is prove probability has a certain mathematical form (Bayes), and that we can give it two numbers: 0, for locally or conditionally false, and 1, for locally or conditionally true. We want, however, to see if other numbers work, too. Above, Pr(M_6|E) = 1/6. Which, again, we got from the statistical syllogism. Let’s now see how we try to prove the validity of the SS.
This is an excerpt from Chapter 4 of Uncertainty. All the references have been removed.
Fairness
Any premises about “fairness” are superfluous to probability, which is to say, to the epistemology of the situation, though they might be important to the ontology. Saying that a Metalunan die is “fair”, if it means anything to the epistemology, is no more than a restatement that each side is “equally likely”, a conclusion we had already reached with the proportional syllogism. That is, given the premise that a device is “fair” then the probability of equal outcomes is uniform; a circular definition. It is like saying, “Given the probability of X is $p$, the probability of X is p“, which is tautological.
But to the ontology, to call a Metalunan—or any Earthly—die “fair”, what else can it mean but to claim that each side is perfectly symmetric, even down to the quantum level (or whatever, if anything, is below that)? To call an object fair, symmetric, balanced, equally weighted or whatever is to say that no inspection would reveal any conceivable asymmetry. What a remarkable claim! This pristine state of proportionality, which I suppose might exist in some fanciful physics experiment of the future, is impossible in practice to verify. How do you know, except by great expense and effort, whether any device is symmetric across all its constituents? How can you ensure any die or coin toss is “symmetric” or “fair”? How can you even define what that means? How can a toss be “fair” except that it is designed to produce equal numbers of heads and tails, or equal numbers of sides, etc.? The answer is obvious; and contra others.
Now it is a separate question whether the manner in which any particular, necessarily physically real, device produces more or less uniform outcomes. There can be no real tosses of a Metalunan interocitor, but that does not stop us from learning its probabilities. But for a real device, we have to do a lot more thinking. To say a device is “fair” says nothing about the mechanism of how that device will register a state. In a real die toss, even if we claim the die itself is “fair”, i.e. perfectly symmetric, we have said nothing about how it will be tossed. These are ontological matters. There will be a gravitational field, perhaps varying. There will be air at a certain density, temperature, and moisture content through which the die flies. The die will leave some person’s hand, perhaps coated with traces of sweat and skin, with a certain spin and momentum; it will have begun its position in the hand in a certain orientation. It will hit the floor or table or whatever at a certain angle, and the floor itself will be more or less elastic and which will give some level of frictional resistance. And this does not exhaust the characteristics of the physical environment of the real toss. Indeed, the number of things which might influence the outcome are very large (but not infinite). Experience tells us that most of these things will have scant or negligible effect, but perhaps, for this toss, something happens which gives more weight to a previously unconsidered dimension. Who knows? Let’s have no more talk about tossing “fair” dice.
We can extend fair to include not only symmetries of the device but also to the environment where the device will be “activated”, but as you can now see, this is to say a lot. To the epistemology, nothing changes: we still have a circular definition of the probability. But to the ontology, it is everything. Perhaps, as in highly controlled experiments, we will have a lot of evidence about the physical set up. But often we do not, especially when investigating the behavior of people, which do not act as predictably as dice. It is boastful to say even of a simple coin or die toss that the environment is “fair.” But we do not have anything like that level of omniscience when it comes to people. Of course, experience over a great many actual dice tosses show us which environments produce uniform outcomes. Casinos rely on this! That experience feeds into our premises and is then used to deduce probabilities.
So what do we say about the chance this real object comes up this or that number? Well, that is the subject of modeling, which we will do later. A brief summary: we begin with whatever clear evidence (premises) we have, judging that some characteristics are important and others ignorable, and then move forward to either make predictions or to experimentation, and after experimentation we produce more predictions.
Details
The statistical syllogism cannot be escaped, and neither can the symmetry of individual constants from which the syllogism is derived. Yet some authors have attempted escapes. The most noteworthy are Jaynes, Diaconis, and Stove. All were interested in assigning equi-probability to events like die tosses. But since assigning equi-probability, or uniformity, has historically been seen as dogmatic, each author tried to derive the assignment of equi-probability from what they saw as different, less dogmatic premises. These attempts are ultimately failures, as I demonstrate below. Stove’s come closest, and indeed has the answer hidden in his effort. This section is necessarily mathematical and could be skipped for those already convinced of the statistical syllogism’s utility; though all should at least skim Stove’s effort. Our first notion of “parameters” arises in these proofs, too.
The following arguments start with the definite knowledge $E$ that $M$ is contingent and can be decomposed into a finite number of possibilities (like sides in coin flips or states of interocitors, or whatever) $M_1, M_2,\dots,M_n$, $n<\infty$.
Jaynes gives a permutation argument in an attempt to deduce the statistical syllogism (he does not call it that), but which relies on an unacknowledged assumption. Introduce evidence $E$ which states that either $M_1$ or $M_2$ or etc. $M_n$ can be true, but that only one of them can be true. In the case where $M$ is a coin flip, the result can be either $M_1$=”head” or $M_2$=”tail”. Thus, $\Pr(M_1\vee M_2\vee\dots\vee M_n|E)=\sum_{i=1}^n \Pr(M_i|E)=1$. At this point, there is no assertion that each of these probabilities is equal, only that the sum is 1. We want to assign the probabilities $\Pr(M_i|E)$ for $i=1\dots n$. The set of possibilities is $M={M_1,M_2,M_3,\dots M_n}$. Let $\pi$ be a permutation on the set ${1,2}$. Let $M’={M_{\pi(1)},M_{\pi(2)},M_3,\dots M_n}$. That is, the set $M$ and $M’$ are the same except the first two indexes have been swapped in $M’$. The evidence $E$ is fixed. Therefore, it must be that $\Pr(M_1|E)_M=\Pr(M_{\pi(2)}|E)_{M’}$ and $\Pr(M_2|E)_M=\Pr(M_{\pi(1)}|E)_{M’}$. Jaynes then makes a crucial step, which is to add to $E$ evidence which states that the total evidence is “indifferent” to $M_1$ and $M_2$, i.e.
if it [the evidence] says something about one, it says the same thing about the other, and so it contains nothing that would give [us] any reason to prefer one over the other (p. 39, emphasis mine).
Accepting this for the moment, $E$ then says that our state of knowledge about $M$ or $M’$ is equivalent, including the order of the indexes. Thus, (note the change in indexes) $\Pr(M_1|E)_M=\Pr(M_{\pi(1)}|E)_{M’}$, $\Pr(M_2|E)_M=\Pr(M_{\pi(2)}|E)_{M’}$ and $\Pr(M_j|E)_M=\Pr(M{j}|E)_{M’}, j=3,\dots,n$. Which implies $\Pr(M_1|E)_M = \Pr(M_2|E)_M$: that is to say, equi-probable or uniform prior assignment.
We seem to have proven equi-probability. And this argument is fine if what Jaynes says in the quotation holds. But we can see in it the presence of two tell-tale phrases, “indifferent” and “no reason”, which are used, and are needed, to justify the final step. This is just begging the question all over again, for how else could the evidence $E$ be “indifferent”? It cannot mean non-probative or irrelevant. That is, Jaynes has assumed uniform probability (and thus, the statistical syllogism) as part of the evidence $E$, which is what he set out to prove.
De Finetti has a famous “exchangeability” theorem which states that if an “infinite series” of “variables” exists and the order in which the variables arise is not probative, then a “prior” probability of the states exists. The form of the prior is not given by the theorem; that is, how the probabilities are assigned is not stated by the theorem; we know only that it exists. Diaconis investigated finite exchangeability in an attempt to see how assignment might arise.
This argument is more mathematically complicated. De Finetti’s theorem, which can be found in many places, states that in an infinite sequence of exchangeable 0-1 variables there is hidden, if you like, a formal (induced) representation as a probability model with a unique measure of the probability model’s parameters. The key, of course, is that the sequence must be infinite. Diaconis, after showing that some finite exchangeable sequences fail to be represented as probability models with unique measures, goes on to offer a proof for certain other finite exchangeable sequences that do. The word “hidden” was apropos, for in exchangeability arises the concept of parameters (in parameterized probability models), a concept which relies on the existence of infinite sequences. I investigate this important topic in Chapter on probability models.
Here, I follow Diaconis (1977) as closely as possible, almost copying the theorem as it stands but using my notation; interested readers should consult the original if they desire the details, particularly since the original uses graphical notions which I ignore. Let $\mathcal{P}_n$ represent all probabilities on $M=\prod_{i=1}^n M_i$ where $M_i={0,1}, \forall i$, where $M$ is a finite ($n<\infty$) sequence of 0-1 variables. $\mathcal{P}_n$ may be thought of as the probability models on $M$: it may be written in coordinate form by $p=(p_0,p_1,\dots,p_{2^n-1})$ where $p_j$ represents the outcome $j$ where $0\le j< 2^n$ is the binary expansion of $j$ written with $n$ binary digits. Diaconis gives the example if $n=3, j=1$ refers to the point $001$. Let $M(m,n)$ be the set of $j$ with exactly $m$ ones. The number of elements in $M(m,n)$ is ${n \choose m}$: this much is true—the number of elements in $M(m,n)$ is ${n \choose m}$—regardless of what the actual probabilities of any outcomes are.
Now, let $\mathcal{E}_n$ be the exchangeable measures in $\mathcal{P}_n$: $\mathcal{E}_n$ will take the place of the measure on $\mathcal{P}_n$’s “parameters”. The theorem is stated thus: $\mathcal{E}_n$ has $n+1$ points $e_0,e_1,\dots,e_n$, where $e_m$ is the measure putting mass $1/{n \choose m}$ at each of the coordinates $j\in M(m,n)$ and mass 0 at the other coordinates. (Uniqueness of each point in $\mathcal{E}_n$ is also covered, but not of interest here.) How is this theorem proved?
$e_n$ represents the measure of drawing $n$ balls without replacement from an urn with $n$ balls, $m$ of which are marked 1, and $n-m$ marked 0, so each $e_n$ is exchangeable. If $e_n$ can be written as a proper mixture of other exchangeable points, it has the form $e_n=pg_1+(1-p)g_0$, where $0<p<1$: also, $g_1, g_0$ must assign 0 probability to the outcomes which $e_n$ assigns 0 probability. But because of exchangeability of the coordinates $j\in M(m,n)$ $g_1$ and $g_0$ must be equal. And because the probability for any $j\in M(m,n)$ must sum to 1—and here is the big assumption used in the proof—the mass of each coordinate is $1/{n \choose m}$.
Clearly, the intuition that gave rise to these particular masses asserted in the proof came from the fact that the number of elements in $M(m,n)$ is . However, other masses work too, as long as they sum to one and assign a probability of 0 to coordinates not in $M(m,n)$. For example, for $j\in M(m,n)$ assign $1/2m$ for the first $m$ coordinates and $1/(2({n \choose m}-m))$ to the remaining ${n \choose m}-m$ coordinates.
The reason that the $1/{n \choose m}$ mass was chosen is understandable, but there was no explicit reason for it (other than having the probabilities sum to 1) and the desire for symmetry and the equi-probable assignment. So again, the statistical syllogism/equi-probability is tacitly assumed.
Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: $WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.
“There are no such things as “fair” dies, or coins, or anything.”
Good god, man — depriving men of their fair dies and coins?!
That’s cold, Briggs.
Tangentially, can you recommend a good probability primer? — a Statistics for Dummies sort of thing? A friend who writes for magazines often sees statistics and studies used in making claims and so wants to better understand how to evaluate them. Not sure your Uncertainty is written for that audience, though I will pass along my copy.
Hold on — bit of searching has turned up your book, Breaking the Law of Averages: Real-Life Probability and Statistics in Plain English. That’s the ticket. Any other recommends? Thanks.
Hagfish,
So what you’re saying is that I should write another, simpler version of either of those two books, and go on to great fame and riches. Right?
1 – re hagfish response: for fame and riches you need to say stuff lots of people believe in ways that help them think they’re right.
2 – to be fair about fair dies… imagine making your argument about the use of “unfair” dice. “Fair” just means there is no info about the distribution ofoutcomes – what jaynes said, above.
3 – no one can prove p=1/n for n possible outcomes about which we have no info because it is not true. P(any of the possible outcomes| no info)= {1,0} ; 1/n is a frequentist estimate, a guess, not p(n). Info can help us make better guesses, but the numeric value in every such case should be understood as an expression ( word) describing our knowledge, or lack of it, about p(n). and not as p(n) itself.
More? What more fame and riches could there be for the Statistician to the Stars?! No sir, what I mean is your material often strikes me as pitched to students who already have some familiarity with the topic and are looking to hone their skills. Do you consider this current series of Lessons to be entry level? They may be, I’ve never taken a class in the subject.
Searching around some more I see an entire shelf of textbooks offered in the “Statistics for Dummies” genre, but I fear you’d rubbish the lot if you read them. If you did do a Briggs for Bozos book would you design it any differently than either Uncertainty or Breaking the Law of Averages or these online lessons?
(By the way, this tl;dr version of Uncertainty (chapter abstracts) is quite good, like distilled spirits):
https://www.wmbriggs.com/post/18724/
Hagfish,
Yes, I’d say entry level, for those who have had some usual math training. Which is not most people, of course. Maybe the first graduate course in the philosophy of uncertainty.
An excellent question about the book. I don’t know. What kind of book would you like to see?
Breaking the Law is a 101-type course, but it’s aimed at statistics. Not for the public per se.
Maybe you’d like a “general audience”, meaning intelligent readers, version? No real math. I mean, no proofs.
I would love to know what to do about this.
3 – no one can prove p=1/n for n possible outcomes about which we have no info because it is not true.
My understanding of statistics is very limited (at best), but I don’t understand this comment. Assuming a six-sided die as an example, then n=6. isn’t this in and of itself info? And if you say, “no one can prove…because it is not true” isn’t that circular reasoning, or assuming the..something-or-other (not sure proper terminology)?
My writer friend describes a memorable lesson in statistics, given in her office where she was health editor at a major fitness magazine, given by some visiting academic researcher babe. This is 1991, in LA, and a study had just come out saying women taking hormone replacement therapy had a 25% greater risk of breast cancer, or something, and women were freaking out. HRT was the miraculous new thing all post-menopausal women were supposed to be doing. This visiting academic researcher babe (VARB) walked her through the study showing how the 25% bombshell was arrived at by statistical legerdemain; comparing apples and oranges, low sample sizes, focusing on older age cohorts, that sort of thing. WF says it opened her young eyes to the potential pitfalls in health claims and the studies supporting them. Ironically, the VARB was not even a supporter of HRT but just a seasoned critic of poor statistical models who had, it seems, at some point run afoul of her academic institution for having little talent at keeping her mouth shut. (I wonder if her name was Briggs?)
So afterwards for any stories relying on studies WF was always interested in looking over the actual study trying to figure out if the data supported the conclusions. Not an easy task for the non-professional given the often technical nature of the beast, obscure terms defined by more obscure terms, math, algebra, notation, et cetera. If only there was a book, or course, or whatever that could show the reasonably intelligent how to read a study, understand the terminology, how data is typically represented in charts and graphs, questions to ask yourself, and what to watch out for — red flags — how studies typically go astray. Instruction aimed at people who encounter statistics in their professional lives but have no training in the field and need to quickly and accurately assess the information’s value.
No small task. I spent an interesting morning looking over Wiki entries for various statistical terms; “statistical hypothesis test”, “null hypothesis”, “significance test”, “inductive inference”, “contingency”, “p-value” — interesting stuff but you can really get in the weeds if you don’t have a good guide. Funnily enough, the entry for “p-value” links to the entry for “p-hacking”, a term coined, according to the article, by a trio of statistical sleuths who run a blog called Data Colada. Let’s check out Data Colada… wo-ho! Briggs you know about these guys, for sure — their hobby is exposing bad studies! Ha! In doing so they walk you through the bad study showing exactly where and how it goes bad. Forensic Statisticians. The top story there now is, “Harvard’s Gino Report Reveals How A Dataset Was Altered”. Fascinating read, and highly educational. Who are these guys? I click on one of them and see a link to something he’s calling his “treadmill desk” —
http://urisohn.com/sohn_files/desk.html
Shazzam. I need one a them. All this time I spend chair-bound hunting and pecking I could be getting in some healthy exercise. Briggs, you need a treadmill desk. You can write your new book on it: Lies, Damed Lies, and Statistics — How to Read Scientific Studies for the Layman. Of course, I have no idea if that would be a worthwhile endeavor compared to other things you might do. Your work shoring up the philosophical foundations of a sounder scientific edifice is important. But you’ve worked in the pop style before with Everything You Know is Retarded, and you have a talent for it; compact, witty, clear. The hard part, as always, is making the complex simple. But not too simple. There’s an art to that.
Thanks for doing this Briggs 🙂
Am (partially) following…
(by which I mean that I’m getting some or most of it)
Received Breaking the Law of Averages, read the preface, started first chapter — this is great, looks like the thing I was describing, as far as entry level. Highly readable.
If nothing else, this lecture is giving me pause to reconsider some of my presuppositions. I’m not too satisfied with some of the explanations, however. I don’t understand why the discussion of “fairness” went immediately into physical properties of physical dice. We were dealing with an abstraction: an abstraction which might act as a useful model for a real thing, but an abstraction nonetheless. What does it mean for an abstraction to be “fair”? If I’m to apply the term at all, I have to think in terms of not showing partiality towards any outcome, such as a nondeterministic process with all possible output states equally weighted. That is also an abstraction, and one which meets what I understand by “fair”. I could also model this in terms of a binary noise sequence. It seems like I’m adding to the ontology relative to the initial deductive proof of Bayes, however, but then the Statistical Syllogism seems to be smuggling in its own rules. I guess I’ll have to hold off until I’ve seen Stove’s proof. If we can derive the Statistical Syllogism from the original premises which gave us Bayes, I’ll be surprised and delighted.
Anyhow, if we have an X with six mutually-exclusive states, and that’s all the information (E) we have, then I’m not sure how to compute Pr(C|E) for C of “X is in state N” for some valid N. It’s definitely true that Pr(C|E) is bounded by (0,1). If we add to “fair” (F) to our givens, then Pr(C|EF) = 1/6, based on my interpretation of “fair”. If I grant the Statistical Syllogism, then Pr(C|E) = 1/6 also, but I have questions. Being told nothing about the system’s proclivity for particular states is not the same as being told it has an equal affinity for each state: being told nothing is compatible with both 1:1:1:1:1:1 weighting and 1:2:3:4:5:6 weighting, as well as every other possible weighting.
On the other hand, I can’t escape the conclusion that a rational actor tasked with handling such a system has no better option than to assume equal probabilities. It’s all very well if you’re a mathematician willing to deal in boundary conditions, but an engineer has to make decisions, preferably rational ones. There’s a big difference between “fair” as a given and strictly “unknown”: the latter case has a broad range of possible behaviour, and a system might need to start with some assumptions, then adjust its behaviour based on incoming evidence (by applying Bayes) to track reality. Even if the two alternatives (with and without F) reduce to the same value, they are qualitatively different: the one without the F needs an asterisk. It’s fundamentally a two-factor problem: what is the actual behaviour of the system, and how much do we know about it? Given F, we know the actual ratio with certainty; without F we have effectively zero knowledge of the ratio, whatever it is. In principle, we could have some suspicions about the ratio that land us between these two points, and there would be a rational way to express this (which I think is pretty obvious).
As for “unfair,” I can only interpret this as “not fair,” and if “fair” implies equal weight for the states, unfair implies not-equal. That excludes the 1:1:1:1:1:1 weighting, but every other possibility still exists. The mathematician still says (0,1) and the engineer still says 1/6 with an asterisk, because excluding that one case makes no difference to the range or the average (because you’re excluding the one case which is dead-on average), and you still don’t know what the actual distribution of outcomes looks like. As such, “unfair” adds no consequential information.