Replacements for Representation: Bayes From the Ground Up

A primary justification for Bayesian probability is De Finetti’s representation theorem, which is stated like this.

You are to observe a sequence of 0s and 1s, “failures” and “successes” if you like. These 0s and 1s will necessarily come to you in a certain order, and you want to quantify the probability that you witness this order.

If you assume that the order in which the failures and successes arrive does not matter—but what does matter is the total number of successes (and failures)—-and if this sequence is embedded in an infinite stream of failures and successes, then the probability distribution of the total successes can be represented as (the integral of) a binomial distribution with parameter θ multiplied by a prior distribution on the possible values of θ.

Have all that? The assumption that the order doesn’t matter—called exchangeability in the parlance—is enough to prove both the existence of the binomial and its accompanying prior distribution. Ain’t that wonderful?

But it only works if there are an infinite stream of numbers coming at us. Let only a finite number arrive, and out goes representation. We know this through the work of Persi Diaconis, who discovered that approximate—not exact—representations can be had for finite data, but only if finite means very large.

If we only have one, two, or a small number of observations, then no representation theorems are possible. It’s not that we haven’t found them, but they cannot be found, an important distinction.

However, we need not despair, because we can still get where we need to go by turning the problem around. By seeing that the fundamental problem is representing uncertainty in finite streams of data, not infinite. Once we have the answer, we can let our data grow large. We will discover that, at the limit, the binomial representation pops out naturally.

Thus it is the binomial that is the approximation of finite situations, not the other way around. How do we start?

In front of you lies a box inside of which you are told has N items, M of which may—or may not—be labeled “success”. This implies that the other items may be anything but successes: we have no information, for example, that the “non-successes” are all identical in nature. These are our 0s and 1s.

How many successes are in the box? You don’t know, but you can quantify your uncertainty. Using a simple principle of logical probability, and the symmetry of individual constants, an axiom similar to the axiom of exchangeability, we can say that, given the evidence presented, the chance that there are no successes is the same as there are one, which is the same as there are two, and so on, up to N.

Suppose you take a handful of items from the box, where the handful is possibly smaller than N. It turns out that, given the evidence we have, the probability distribution that represents your uncertainty in the number of successes in your handful is represented by the hypergeometric distribution.

Unlike the binomial, which has unobservable parameter θ, the hypergeometric deals only with what can be or has been observed. Its parameters are all numbers you have seen.

In your hand now are a certain number of successes and non-successes, which is new evidence we can use to infer the likelihood that the remaining items in the box are also successes (or failures). We can work through the math and discover the representation of the probability distribution for the remaining items. This turns out to be a “beta-binomial” with fixed, observed, known parameters.

More data can be taken, and all the probability distributions can be updated systematically using just observed and known parameters.

What’s interesting is that as you let N grow to the limit, the standard binomial, beta, and beta-binomial results of Bayesian statistics are found. But then, as now makes sense, the parameters of these distributions become unobservable.

In the finite case, the parameters were all known numbers, but in the infinite case we have to wait until—well, we have to wait until we have reached an infinite number of observations until we can claim to have observed all the facts.

In the finite case, given the evidence and previous observations, the probability of future observations is always more spread out—it is more uncertain—than are future observations if you assume you will have an infinite amount of data. And since we never will see an infinite amount of data, standard results make us more certain than is warranted.

That’s the story in 750 words, but if you want to read more, and delve into the math, you can download this preprint. It’s a paper my friend Russ Zaretzki and I wrote, but was rejected (by the American Statistician) for “poor writing,” a damning criticism for papers supposed to appear in the “Teaching” sections.

This shows you that peer review sometimes works. Because the paper is poorly written. We’re having another go at cleaning up the notation, which proliferated rather profusely.

10 Comments

  1. You could probably write this less clearly. Keep trying.

    An example: what exactly was the point in having M out of N items then mentioning (what seems to me) that M represents all ‘successes’ but the collection of M may be a superset? You definitely lost me around that time. It seems an unneeded introduction of a third variable X. Who really cares that the non-‘successes’ in N may or may not be equal and why? If you actually got around to stating it then I missed it. If it’s inconsequential then why mention it?

  2. DAV,

    Good catch! What you found me doing is what I dislike in others’ writing: needless notation. It proliferates wildly in mathematics.

    If I have a chance, I’ll re-write and eliminate what notation I can.

    Thanks.

  3. Hi Matt,
    I could not get the link to the Zaretzki and Briggs preprint to work.
    Thanks,
    Jon

  4. I don’t quite see why the uniform assumption on the number of successes is the most reasonable one. If we really know nothing about the subset of the population which are successes, then we might assume a prior on the success-subset as uniform on the collection of all 2^N subsets of the population. This gives a distribution on the number of successes M: P(M)=2^(-N)*(N choose M). For largish N, this puts a very strong prior on M being close to N/2.

    “Success” here is a loaded term, isn’t it? It evokes an image of coin flipping as we sample. In this case though, we have a static population of N, with each member already having or not having a certain characteristic. Maybe “positives” and “negatives” would be a better term.

  5. Steve,

    Yes, that’s the difference between Carnap’s c-sharp and c-dagger measures (the 2^N or, say, raw probability). And that is the main “paradox” which caused many to reject Carnap’s ideas.

    But I reject the “2^N” view as unnatural because of the reason that the tautology, “There may be none, 1, up to N objects” as gives no information except that even is contingent. And that, due to the axiom I mentioned, gives the uniform result. The uniform result, I emphasize, is not assumed. I’ll put this proof up when I have a chance. Probably in paper form; too much notation otherwise.

  6. Interesting! If I have two kids (N=2), and each either has (=1) or does
    not have (=0) some characteristic, then without other knowledge it seems that logical probability requires each possibility for the two kids: P(0,0)=P(0,1)=P(1,0)=P(1,1)=1/4. This is nice because the marginal probability for each kid is then P(0)=P(1)=1/2, which again seems consistent with logical probability. But then P(M=0)=P(M=2)=1/4 and P(M=1)=1/2, which is very different from P(M=0)=P(M=1)=P(M=2)=1/3.

    It seems that you are saying that logical probability can be applied to a statistic (M) that describes a phenomenon (the 1-subset), rather than applying it directly to the phenomenon itself and deriving probabilities for statistics from that. It seems we could cook up any number of descriptive statistics. How can we assume equal probabilities on the possible values of each of these statistics without reaching some contradiction? For example, if I define W=floor((M/(N+1))^2), then W will be either 0 or 1, but assigning 1/2 probability to each of these won’t be consistent with P(M=0)=P(M=1)=…=P(M=N).

  7. Steve (and thanks for the reminder Tom),

    If I understand you, it depends on the evidence. All probability is conditional on stated premises, or evidence. The evidence, “It will happen or it won’t” is a tautology and the probability of a hypothesis given a tautology is the same as the probability of the same hypothesis given any tautology, for example, “Green is green.” So in your example, what we really have is more information than just the tautology. We have something like, “A child will have just 1 of two possible traits; M is trait” and given that, the probability of “The child has M” is 1/2. If we change that first premise to “n possible traits” then the probability of the conclusion (relative to that change) is 1/n. I would write this as Pr(M | Premises) = 1/n.

    I now invite you to frame the premises of the “groupings” conclusion, your P(M=0)=P(M=1)=P(M=2)=1/3. First note that we must write the probability as Pr(M | Evidence), where you must supply the evidence in a form similar to that above. You will quickly see the difficulty of reconciling the two conclusions, because, of course, they are based on different evidence.

    This is actually a fundamental point. We should expand on this.

  8. Hey, thanks for your reply.

    I am not sure you have understood me. I have N=2 kids. Each kid either has or does not have a particular characteristic, I know nothing else about the characteristic. Exactly one subset of my kids is the set of my kids who have the characteristic. Call this subset the “1-subset.” There are 4 subsets, so the logical probability of each being the 1-subset is 1/4. It’s not unlike rolling a 4-sided die, and dice we have discussed before.

    http://wmbriggs.com/blog/?p=2514

    M is the number of my kids who have the characteristic. It is a statistic describing the 1-subset. Thus M is one of three values: 0,1 or 2. To be consistent with my previous paragraph, we need to have P(M=1)=1/2. But according to your post associated with this comment, you would have P(M=1)=1/3, i.e. each value of M is equally likely. You also say this is based on logical probability. It seems that you have applied logical probability to the statistic M, rather than to the underlying phenomenon, i.e. the 1-subset.

    The trouble I see with applying logical probability to a statistic,
    rather than the underlying phenomenon it describes, is that it can lead to contradictions. Consider your post on the size of a cube

    http://wmbriggs.com/blog/?p=2599

    In that post, you have a finite number of possible cubes and the same number of possible values for certain statistics which describe cube size. We can use volume or edge-length, it doesn’t matter. This is because the values for the statistic and the cubes themselves are in 1-1 correspondence.

    But in the example with my kids, there are 4 possiblilities for the 1-subset, but only 3 possible values for the descriptive statistic M. This opens a kettle of worms, since I can cook up another statistic (example in my previous post) with only 2 possible values. Putting a uniform probability on that is not consistent with a uniform probability on M.

Leave a Comment

Your email address will not be published. Required fields are marked *