Jaynes’s book (first part): https://bayes.wustl.edu/etj/prob/book.pdf
Permanent class page: https://www.wmbriggs.com/class/
Uncertainty & Probability Theory: The Logic of Science
Link to all Classes. Jaynes’s book (first part):
Video
Links:
Bitchute (often a day or so behind, for whatever reason)
HOMEWORK: Find the correct probabilities to Sally’s GPA using G (I did it using G’, which is described below).
Lecture
Stuck with the shiny whiteboard. Still working on it.
In the lecture and below I talk about the discrete and finite nature of all scientific measurements. But I made a mistake in the book! I correct it here.
We’ll talk about this many times. Here it is simple. Suppose Sally takes a college class that is graded on a scale 0, 1, 2, 3, 4, and no other numbers are possible. She takes one class. We deduce her GPA could be, and only could be, 0, 1, 2, 3, 4. There are no other possibilities.
Now suppose she takes two classes. The GPA possibilities are now 0, 0.5, 2, 2.5, …, 4. There are no other possibilities. The mistake in Uncertainty is assuming each of these is equally likely. They are not, given G. Given G’ = “The possibilities are 0, 0.5, 2, 2.5, …, 4”, which is not G, the probabilities of each GPA are equal. Given G, they are not. Because, for instance, there are two ways to get 0.5, a fact we deduce from G, but not G’. If you read the book substituting G’ in for G, all works. Homework is to figure right correct probabilities.
Now suppose she takes three classes. The GPA could be 0, 0.33, …, 4. There are no other possibilities.
However many classes she takes, given the grading rules and the formula for GPA, there are a fixed, finite deduced number of possibilities. This is so no matter how many students we consider. And this is so for whatever we measure, our limitations usually a function of the measurement apparatus.
The set of possible measurements may eventually be large, but most will be trivial to store on any computer. We could, and should, use just this set to categorize or quantify our uncertainty. Never in what was measured, but to express uncertainty in what we might measure.
Ordinary practice is to model what was seen not to make predictions, but to say something about “true distributions” of probabilities, which are somehow causal, and which makes no sense. As we shall see below.
This is excerpt from Chapter 6 of Uncertainty. References have been removed for easier coding.
People sometimes speak as if random variables “behave” in a certain way, as if they have a life of their own. Thus “X is normally distributed”, “W follows a gamma”, “The underlying distribution behind $y$ is binomial”, and so on. To behave is to act, to be caused, to react. Somehow, it is thought, these distributions are causes. This is the deadly sin of reification, perhaps caused by the beauty of the mathematics where, due to some mental abstraction, the equations undergo biogenesis. The behavior of these “random” creatures is expressed in language about “distributions.” We hear, “Many things are normally (gamma, Weibull, etc., etc.) distributed”, “Height is normally distributed”, “Y is binomial”, “Independent identically distributed random variables”.
I have seen someone write things like “Here is how a normal distribution is created by random chance” Wolfram MathWorld writes, “A statistical distribution in which the variates occur with probabilities asymptotically matching their ‘true’ underlying statistical distribution is said to be random.” There is no such a thing as a “true” distribution in any ontological sense. Examples abound. The temptation here is magical thinking. Strictly and without qualification, to say a thing is “distributed as” is to assume murky causes are at work, pushing variables this way and that knowing they are “part of” some mathematician’s probability distribution. To say “X is normal” is to ascribe to X, or to something, a power to be “normal” (or “uniform” or whatever). It is to say that forces exist which cause X to be “normal,” that X somehow knows the values it can take and with what frequency. If this curious power notices we have latterly had too many small X, it will start forcing large ones so that the collective exhibits the proper behavior. This is akin to the frequentist errors we earlier studied.
To say a thing “has” a distribution is false. The only thing we are privileged to say is things like this: “Give this-and-such set of premises, the probability X takes this value equals that”, where “that” is calculated via a probability implied by the premises. (Ignore that the probability X takes any value for continuous distributions is always 0; this is discussed much later under measurement.) Probability is a matter of ascribable or quantifiable uncertainty, a logical relation between accepted premises and some specified proposition, and nothing more.
Observables also do not “have” means. Nor do they have variances, autocorrelations, partial or otherwise, nor moments; nor do they have any other statistical characteristic you care to name. Means and all the rest can be calculated of observables, of course, but the observables themselves do not possess in any metaphysical sense these characteristics. This goes for observables of all kinds. Time series are, in some analyses, supposed to be “stationary”. A stationary process, it is said, has the property that the mean, variance and autocorrelation structure do not change over time. Actual functions of observables such as means do change over time, as all know. Premises from which we deduce probabilities if they include observable propositions can also change, and thus so can the probabilities. Specific model premises which hold fixed various parameters (about which much more later) can be assumed or not. That is all stationarity means epistemologically. Causes of observables can and often do change, but since probability is never a cause, neither can stationarity nor any other statistical characteristic be a cause.
Back to Sally and her grade point. We had S = “Sally’s grade point average is $x$”. Suppose we have the premise G = “The grade point average will be some number in this set”, where the set is specified. Given our knowledge that people take only a finite number of classes and are graded on a numeric scale, this set will be some discrete finite collection of numbers from, say, 0 to 4; the number of members of this set will be some finite integer $n$. Call the numbers of this set $g_1, g_2,\dots,g_n$. [Here the mistake begins: in your mind, swap G’ for G, and it all works. Homework is to redo this section using the real G.]
As said above, the probability of S given G does not exist. This is because $x$ is not a number; it is a mere placeholder, an indication of where to put the number once we have one in mind. It is at this point the mistake is usually made of saying $x$ has some “distribution”, usually normal or perhaps uniform (nearly all researchers I have seen in applications of GPA say normal). They will say “$x$ is normally distributed.” Now if this is shorthand for “The uncertainty I have in the value of $x$ is quantified by a normal distribution”, the shorthand is sensible—but unwarranted. There are no premises which allow us to deduce this conclusion. The conclusion is pure subjective probability (and liable to be a rotten approximation).
Evidently, many do not intend this meaning, and when they say “$x$ is normally distributed” they imply that $x$ is itself “alive” in some way, that there are forces “out there” that make, i.e. cause, $x$ to take values according to a normal distribution. Maybe the central limit theorem lurks and causes sums of individual grades, which form the GPA, to take certain values. This is incoherent. Each and every grade Sally received was caused, almost surely by a myriad of things, probably too many for us to track; and there is no indication that the same causes were at work for every grade. But suppose each grade was caused by one thing and the same thing. If we knew this cause, we would know the value of $x$; it would be deduced from our knowledge of the cause. And the same is true if each grade were caused by two known things; we could deduce $x$. But since each grade is almost surely the result of hundreds, maybe thousands—maybe more!—causes, we cannot deduce the GPA. The causes are unknown, but they are not random in any sense where randomness has causative powers.
What can we say in this case? Here is something we know:
Pr(x = g_1 | G) = Pr(x = g_2 | G),
where $x = g_1$ is shorthand for S = “Sally’s GPA is $g_1$” (don’t forget this!). This is the symmetry of individual constants, as seen in Chapter 5. G is equivalent to “We have a device which can take any of $n$ states, $g_1,…,g_n$, and which must take one state.” From this we deduce
Pr(x = g_i | G) = 1/n, i = 1,2,…,n.
There are no words about what caused any $x$; merely deduced information that the chance we see any value is as likely as any other value in the set of possible values. We could say that the uncertainty in $x$ is quantified by a uniform distribution over $g_1,\dots,g_n$, but since that leads to sin, it is better to say the former. Incidentally, a natural objection is that GPAs don’t seem to be equally likely to be any number between 0 and 4, but that is because we mentally add to G evidence which is not provided explicitly. (I’m not claiming G is a good model.)
Can propositions have “true” distributions? Only in a limited sense. So-called random variables do not have to represent the “outcome” of the event from some experiment. Suppose X = “The color of the dragon is x”; if we let D = “Dragons can be green, black, or puce”, the probability of “x” is easily computed, but we will never see the event. And there will be no real cause, either. This is the true probability, or true distribution, if you like. Any time we can deduce the “model”, as it were, we have a true probability. But it is never the proposition that “has” a distribution or probability, it is only our understanding that does.
Lastly, when people think variables have “true” distributions, they are likely to blame data which does not conform to their expectations. Thus we see people tossing out “outliers”. And since current practice revolves around model fit, data which does not fit increases the fit of what is left, leading to over-certainty.
Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.