
ET Jaynes, Chance Master
(All the stuff in this series is, in a fuller form, in my new upcoming book, which is tentatively called
Logical Probability and Statistics—but I’ve only changed the title 342 times, so don’t count on this one sticking. Incidentally, I’m looking for a publisher, so if you have a
definite contact, please email me.)
Read Part IV.
From where do parameters emerge? And what is the difference between treating them objectively and subjectively? A lot of controversy here. The difficulty usually begins because the examples people set themselves to discuss these subjects so far advanced that they have too many (hidden) assumptions built in such that it makes understanding impossible. From our previous examples, we have seen it is better to start small and build slowly. We can’t avoid confusion this way, but we can lessen it.
If you haven’t reminded yourself of the last two posts on the statistical syllogism, do so now, for it is assumed here.
Premise: “There are N marbles in this bag into which I cannot see, from which I will pull out one and only one. The marbles may be all black, all white, or a mixture of these.” Proposition: “I pull out a white marble.”
What is the probability this proposition is true given this and only this premise? Well, what do we know? Black marbles are a possibility, and so are white. Green ones are not, nor are any other colors. We also know that the number of white marbles may be 0, 1, 2, …, N. And the same with black, but with a rigid symmetry: if there are j white marbles then there must be N – j black ones. (There may be other things in the bag, but if so, we are ignorant of them.)
This is not like the example where all we knew was that “Q was a possible outcome”, and where we assigned (0, 1] to the probability “A Q shows”, because there the number of different kinds of possibilities were unknown, and could have been infinite. Here there are two known possibilities. And N is finite.
Suppose N = 1. What can we say? “There are 2 states which could emerge from this process, just one of which is called white, and just one must emerge.” The phrasing is ugly, and doesn’t explicitly show all the information we have, but it is written this way to show the continuity between this premise and the one from last time (i.e. “There are n states which could emerge from this process, just one of which is called Q, and just one must emerge” and “A Q emerges”).
The probability of “I pull out a white marble” is, via the statistical syllogism, 1/2—again, when N = 1. This accords with intuition and with the definite, positive knowledge of the color and number of marbles we might find.
Now suppose N = 2. The number of possibilities has surely increased: there could be 0 whites, just 1 white, or 2 whites. But we’re still interested in what happens when we pull out just one and not more than one. That is, our premise is “There are 2 marbles in this bag into which I cannot see, from which I will pull out one and only one. The marbles may be all black, all white, or a mixture.” Proposition: “I pull out a white marble.”
If there were 0 whites, the probability of pulling out a white is clearly 0. If there were 1 white, we have direct recourse to the statistical syllogism and conclude the probability of pulling out a white is 1/2. If both marbles were white, the probability of pulling out a white is 1. Agreed?
But what, given our premise, is the probability that both marbles aren’t white? That just 1 is? That both are? Easy! We just use the statistical syllogism again. As: “There are 3 states which could emerge from this process, and only one of these three must emerge, which are 0 whites, 1 white, 2 whites” and “The state is 0 whites” or “1 white” or “2 whites”, and the “process” is whatever it was that put the marbles in the bag. Evidently, the probability is, via the statistical syllogism, 1/3 for each state.
It now becomes a simple matter to multiply our probabilities together, like this (skipping the notation):
1/3 * 0 + 1/3 * 1/2 + 1/3 * 1 = 1/6 + 2/6 = 1/2,
where this is the probability of each state of the number of white marbles times the probability of pulling a white marble given this state, summed across all the possibilities (this is allowed because of easily derived rules of mathematical probability).
With N = 2, the answer for “I pull out a white marble” is again probability 1/2. The same is true (try it) for N = 3. And for N = 4, etc. Yes, for any N the probability is 1/2.
This is the same answer Laplace got, but he made the mistake of using the chance the sun would rise tomorrow for his example. Poor Pierre! Not a thing wrong with his philosophy, but every time somebody heard the example he turned into a rogue subjectivist and would not let Laplace’s fixed premises be. Readers could not, despite all pleas, stop themselves from saying, “I know more about the sun than this bare premise!” That is a true statement, but it is cheating to make it, as all subjectivists substitutions are.
Emphasis: we must always take the argument as it is specified. Adding or subtracting from it is to act subjectively, that is, it changes the argument.
We don’t have to use marbles in bags. We can say “white marbles” are “successes” in some process, and “black marbles” failures. As in it is a success if the sun rises tomorrow, or that a patient survives past 30 days, or more than X units of this product will be sold, and on and on. If we know nothing about some process except that only successes and failures are possibilities, and that there may be none of either (but must be some one one), and that we have N “trials” ahead of us, and N is finite, then the chance the first one we observe is a success is 1/2.
A “finite” N can be very, very, most very large, incidentally, so there is no practical limitation here. Later we’ll let it pass to the limit and see what happens.
This is the objectivist answer, which takes the argument as given and adds nothing about physics, biology, psychology, whatever. Of course, in many actual situations we have this kind of knowledge, and there is nothing in the world wrong with using it, provided it is laid out unambiguously in the premises and agreed to. To repeat: but if all we had was only the simple premise (as we have been using it), then the answer is 1/2 regardless of N.
The emergence of parameters and models
Since we have the full objectivist answer to the argument, it’s time to change the argument above to something more useful. Keep the same premise but change the conclusion/proposition to “Out of n < N observations, j will be successes/white,” with j = 0,1,2,…,n. In other words, we’re going to guess how many successes we see from the first n (out of N). Did we say same premise? Emphasis: same premise, no outside information. Resist temptation!
If n = 1, we already have the answer (the probability is 1/2 for j = 0, and j = 1). If n = 2, then j can be 0, 1, or 2. The statistical syllogism comes to our aid as before and in the same way. Before we’ve seen any “data”, i.e. taken any observations, i.e. have any knowledge about how many successes and failures lie ahead of us, etc., the probability of j equaling any number is 1/3 because there are 3 possibilities: no successes, just one, both. The result also works for any n: the probability is 1 / (n+1), and this works with n all the way up to N.
We’re done again. So now we must pose new arguments. Keep the same premise but augment it by the knowledge that “We have already seen n1 out of the n observations” where, of course, n1 is some number between 0 and n. We need a new conclusion/proposition. How about “The next observation will be a success/white marble”, where “next” means the “n+1″th.
In plainer words, we have some process where we’re expecting success and failures—N of them where N may be very exceptionally big—and we have already seen n1 out of n successes, and now we want to know the probability that the next observation will be a success. Make sense?
Notice once more we are taking the objectivist stance. There are no words, and no evidence, about any physicality, nor any about biology, nor anything. Just some logical “process” which produces “successes” and “failures.” We must never add or subtract from the given argument! (Has that been mentioned yet?)
Proving the answer now involves more math than is economical to write here. But it comes by recognizing that we have taken n marbles out of the bag, leaving N – n. Of these first n, some might be successes, some failures, with n1 + n0 = n. We’ll use these observations as new evidence, that is new premises, for ascertaining the probability that next (as yet unknown) observation is a success.
The probability of seeing a new success (given our premise) is (n1 + 1) / (n + 2). If we only pulled n = 1 out and it was a success, then the probability the next observation is a success is 2/3; if the first observation was a failure then probability next observation is a success is 1/3. Again, this is independent of N! But it does assume N is finite (though there is no practical restriction yet).
This answer is also the same as Laplace got. Where he went wrong was to put in an actual value for n: what he thought was the number of days that sun had already come up successfully, and then used the calculation to find the probability (given our premise) the sun would come up tomorrow. Well, this was some number not nearly equal to 1, because Laplace didn’t think the world that old. Since the probability was not close to 1, and people thought it should, logical probability took a big hit from which it is only now recovering. His critics did not understand they were changing his argument and acting like subjectivists.
Let’s play with our result some more. It’s starting to get good. Suppose n is big and all the observations n1 are successes, then (n1 + 1) / (n + 2) approaches 1. If we let the “process” go to the limit, then the probability becomes 1, as expected. The opposite is true is we never see any successes: at the limit, the probability becomes 0. Or, say, if we saw n1 = n/2, i.e. our sample was always half successes, half failures, then at the limit, the probability goes to 1/2.
It is from this limiting process that parameters emerge. Parameters are those little Greek letters written in probability distributions. But before we start that, let’s push our argument just a little further and ask, given our premise (which includes the n observations) not just what is the next observation will be, but what the next m observations will be. Obviously, m = 0 or 1 or 2 etc. all the way up to m.
The probability m takes any of those values turns out to be what is called a beta-binomial distribution, whose “parameters” are m, n1 + 1, and n0 + 1. Parameters is in quotes because here we know their exact values; there is no uncertainty in them; they have been deduced via the original premise in the presence of our observations. Notice that this result is independent of N, too (except for the knowledge that n and n + m are less than or equal to N).
The beta-binomial here is called a “predictive” distribution, meaning it gives the probability of seeing new “data” given what we saw in the old data.
In other words, we don’t need subjective (or objective) priors when we can envision a process which is analogous to a bag with a fixed number of successes and failures in it (with total N), and where we take observations n (or m) at a time. That’s it. This is the answer. We have deduced both a model in which are not needed unobservable parameters. No Greek letters here. The answer is totally, completely objective Bayes.
There is no subjectivity, no talk of “flat” or “uninformative” priors, no need for frequentist notions, no hypotheses, nothing but one bare premise, some rules of mathematical probability accepted by everybody (because they are true), and we have an answer to a question of real interest. Ain’t that slick!
Where It Starts To Go Wrong
Everything worked because N was finite. We lived in a discrete, finite world, where everything could be labeled and counted. Happily, this world is just like our real world, a hint that difficulties arise when we leave our real world and venture to the abstract world of infinites.
We could let N go to the limit immediately (before taking observations). Now no real thing is infinite. There never has been, nor will there ever be, an infinite number of anything, hence there cannot be an infinite number of observations of anything. In other words, there is just is no real need for notions of infinity. Not for any problem met by man. Which means we’ll go to infinity anyway, but only to please the mathematicians who must turn probability into math.
If N goes to the limit immediately, the number of successes and failures, or rather the ratio of the number of successes (in our imaginary bag) to N becomes an unobservable quantity, which we can call θ. If N passes to the limit, θ can no longer ever equal 0 or 1, as it could in real life, but will take some value in between.
We now have a brand-new problem in front of us. If we want to calculate the probability that the first observation is a success, we have to specify a value for θ by fiat, i.e. arbitrarily and completely subjectively, or we have to let it take each possible value it can take and then multiple the probability it takes this value by the probability of seeing a success given this θ. And then add up the results of each multiplication for each value of θ.
That’s easy to do with calculus, of course. But it leaves us the problem of specifying the probability θ takes each of its values (the number of which is now infinite). There is no other recourse but do to this subjectively. We could say (subjectively) that “Each value of θ is equally likely” or maybe “Let’s assume a flat prior for θ” which is the same thing, or we could say, “Let’s pick a non-informative prior for θ” which is also the same thing, but which involves invoking other probability distributions to describe the already unknown one on θ.
Is there any wonder that frequentists marvel at this sort of behavior? The “prior” appears as it is: arbitrary and subject to whim. (Of course, the frequentist after getting this argument right falsely assumes that frequentism must therefore be true; the fallacy of a false dichotomy.)
It turns out, in this happy case, that if a “flat” prior is assumed, the final answer is the same as above; i.e. we end with a beta-binomial predictive distribution. We start with a “binomial” and get a beta “posterior” on θ, but since we can never observe θ, it is another wonder that anybody ever cares about it. People do, though. Care about θ, to an extent that borders on mania. But that people do is a holdover from frequentist days, where “estimation” of the unobservable is the be-all and end-all of statistics. Another reason to cease teaching frequentism.
But the happenstance between the finite-objectivist answer with the mathematical-subjectivist one in this case is why subjectivism seems to work: it does, but only when the parameters are themselves constrained. Here θ can “live” only in some demarcated interval, (0,1). But if the parameter is itself without limit, as in the case of e.g. “normal” distributions, then we can quickly run into deep kimchee.
The problem comes from going to infinity too quickly in the process, a dangerous maneuver. Jaynes highlighted this to devastating effect in his book, destroying a “paradox” caused by beginning at infinity instead of ending with it. (See his Chapter 15 on the “marginalization paradox.”) If there is any blemish in Jaynes’s work, it is that he failed to apply this insight to every problem. In his derivation of the normal, for example, he began with the slightly ambiguous premise that measurement error can be in “any” direction. Well, if “any” means any of a set of measurable (by human instrument), then we are in business. But if it meant any in an infinite number of directions, then we begin where we should end.
All probability and statistics problems can be done as above: starting with simple premise, staying finite until the end, after which passing to the limit for the sake of ease in calculation may be done, and proceeding objectively, asking questions only about “observable” quantities. That is the nature of objective, or logical, probability.
Questions?
If there are a lot of questions, I might add an addendum post to this series. But please to read all the words above (at least once) before asking. On the other hand, I may just wait for the book.