# A Probability Non-Paradox

Before you is a box in which is a slip of paper on which is written either ‘0’, ‘1’, ‘2’, or ‘3’. Given that premise, what is the probability of X = “‘3’ is written”?

Right: it’s 1/4.

Notice, incidentally, the probability does *not* tell us how that number came to be written on the paper, nor why the paper is there, nor how you will “draw” the paper from the box, information which is any case is irrelevant. All that we deduce from the premise is that writing exists and you are partially but not wholly ignorant of what it is.

Different set up. In a new box (or even the same!) are three marbles, each either white or black. Given that premise, what is the probability of X = “The number of white marbles is 0 (or 1, or 2, or 3)”? First note that we deduce there may be *no* white marbles, or no black ones, or any combination of the two, as long as there are 3 in total.

The probability may not be as obvious, and indeed the formal mathematical proof begins with the hypergeometric “distribution”, noting the logical equivalence of constants, and carrying all this forward. You can take my word—or Laplace’s, a more eminent authority, and the man who first derived it—that the calculations produce a probability of 1/4 for 0 white marbles (and 1/4 for each of 1, 2, or 3), which might be in line with your intuition, as suggested by the first example.

Again notice that there isn’t any word about “drawing” the marbles out, how the marbles got their color, or anything else to do with causes. Though some thing or things must have caused the color and number of the marbles, but we know it not.

A third set up. In a new new box are three marbles, each either white or black, and we’re going to draw out three of them. Given that premise, what is the probability of X = “The number of white marbles is 0 (or 1, or 2, or 3)”?

Let’s enumerate. Given this premise, we could see any of the following sequences, and number of white marbles:

- W
_{1}W_{2}W_{3}, 3 - W
_{1}W_{2}B_{3}, 2 - W
_{1}B_{2}W_{3}, 2 - B
_{1}W_{2}W_{3}, 2 - W
_{1}B_{2}B_{3}, 1 - B
_{1}W_{2}B_{3}, 1 - B
_{1}B_{2}W_{3}, 1 - B
_{1}B_{2}B_{3}, 0

This indicates the probability (given the premises) of 0 whites is 1/8, of 1 white is 3/8, of 2 whites is 3/8, and of 3 whites is 1/8.

Something has changed. The second example, conditional on very similar premises to the third, gave a probability of 1/4 for each possibility, while the third example gives varying answers. What gives? “Paradox!” answer some. Howson and Urbach, in their influential *Scientific Reasoning: The Bayesian Approach*, argue from the apparent paradox to a justification of subjective probability (p. 59-62 in the second edition; pdf first edition). Besides being wrong, this is defeatist. “We don’t know what the probability should be so we can make up whatever *feels* best” can scarcely be a satisfactory answer. (Though it is to all “subjective Bayesians”.)

Still, it’s odd. Similar premises give completely different answers. To see what’s happening, let’s change the premise in the first example so that the slip of paper has the marking “W_{1}W_{2}W_{3}“, or the marking “W_{1}W_{2}B_{3}“, or, etc. Given that premise, what is the probability of the marking “B_{1}W_{2}B_{3}“? Easy: 1/8—and it’s the same for any marking. The probability would also be 1/8 were the premise that the paper had the marking ‘1’, ‘2’, etc., ‘8’.

So what makes writing the labels 0 through 3 the same (in probability) as “In a new box are three marbles, each either white or black”? And why are they dissimilar to “In a new new box are three marbles, each either white or black, and we’re going to draw out three of them”?

Cause; rather, our knowledge of causes. All we require of the markings in the first example is that they be distinct. All? Well, not quite all. We also require that the label, whatever it is, be written in advance, by some (unknown, unspecified) cause. That cause *fixes* the labels or balls in advance. Our knowledge of cause in these cases is extremely limited: we only know that a cause, *or causes*, must have been present, and we know the outcome.

But there is no way to think of drawing out marbles without envisioning some kind of drawing-out cause. If you are practiced at simulation, this will make sense—and all of us are practiced at simulation. Think coin flips. There is no way to imagine, or rather to manufacture, a string of three flips, or three anythings with dichotomous outcomes, that does not make reference to physical causes.

Our understanding of causes in the two situations is a real, and huge, difference. At least for the very small. Once we get going and start taking observations, (it can be shown) the two views “collapse” to the same, especially for “large N” (many observations).

This is not the only system in which the measurement process dramatically changes our perspective, as the not inapt comparison to quantum mechanics reveals. *Anything-we-know-not-what* could have fixed the labels/constituents of the box, whereas not just any old thing could make a string of white/black (0/1, etc.) to emerge from a box. Far from being a paradox, the differences in probabilities highlight the importance of measurement and the knowledge which comes with it.

Conclusion? There is no problem with logical probability, a.k.a. probability as argument.

This looks like Gibbs’ paradox or the difference between distinguishable and indistinguishable particles.

Am I missing something? I don’t see 1 out of 8 chances for drawing zero white marbles (or 1, or 2 or 3). I see only 4 possible results not 8. All white, all black, 2 white 1 black, 2 black 1 white.

No paradox. The action of drawing has no functional effect on outcomes 2-3-4 and 5-6-7 of the marbles example. Drawing does make a difference in the order of colors but not the ratio of white to black marbles or the fact that at least one white one is in each group. So the probability of 0 white still is 1/4, not 1/8. Drawing (ie, measurement) subtly introduces another condition that’s irrelevant to the question.

Premises here are stated somewhat loosely. In the first example, we’re not told how the number to be written on the slip of paper was selected. Or, put another, we’re not told how we know that the slip of paper has 0,1,2, or 3 written on it. All the information we have is that there’s an natural number <4 written on a piece of paper. From that information, I don't think the probability that 'X=3' is 1/4. I don't think that premise allows any probability to be defined. In order for probability to be 1/4, we'd have to know that the choice of 0,1,2, or 3 was made via some random process where these four numbers had equal probability.

In the next example, it's not stated that each marble has a 50-50 chance of being black or white, although that's evidently intended.

The next example doesn't say whether the sampling is done with or without replacement. I presume it's the former. If the sampling were done without replacement, then the sample of 3 marbles would be the entire box, and the probabilities would be 1/4 for each combination. The "paradox" seems to be simply the difference between sampling with or without replacement.

In examples 1 and 2 I’m going to agree with David in Cal and say that there is an issue with under-specification, or in your terms, lack of information about what the actual cause is. In these cases, we simply don’t know what the probability is unless we supply additional assumptions. We can only get the answer 1/4 if we make an assumption along the lines of the Principle of Indifference, or the MAXENT rule, which in effect assume we are in a state of maximum uncertainty. But as I’m sure you know well, these rules pertain to bayesian probability rather than logical probability. I think you owe us a little more explanation of how you can get to 1/4 without them.

Bumble,

Some confusion here, evidently owning to classical interpretations of these problems.

The principle of indifference is not used nor is Maxent. To get the first (well known) result, start with the hypergeometric, recognize the symmetry of individual constants, and then “integrate” (add) over the possibilities. This is what Laplace (and everybody) does. Taking it to the limit gives the standard, right-out-of-the-textbooks beta-binomial model. If I can find a link to the math, I’ll put it up.

UpdateThis one from Jim Franklin is a good start (still missing one!): pdfSymmetry of individual constants? It is from

thatsimpler, more fundamental fact about logic from which are derived “indifference” and Maxent rules. I have a paper somewhere…(though this needs to be fixed in parts): http://arxiv.org/abs/math/0701331David in Cal,

No. It matters not one whit how the paper is selected. How could it? The paper is there with the number on it! It doesn’t even have to be in a box. It just has to exist, and not even physically, only in your intellect. It is the very point that

we do not knowhow the label came to be what it is. We do not know what it is, onlythatit is. We do know more about how items must “come into existence” in the third example.No. It is

notthat each marble has a 50-50 chance of being white. That isnotintended: that information isdeduced, not fixed.Obviously, “sampling” “without replacement.” You’re over-thinking here.

You must use

onlythe information supplied in the premises (and that which can bededucedfrom it). You mustnotsupply your own information. That’s cheating. And that’s what leads to paradoxes.Gary,

Measurement is the difference.

Jim,

Yes, you’re missing something.

Scotian (for others, Scotian is a physicist),

Yes, something very like that. Exactly.

Shouldn’t the third probability statement be “The number of white marbles

, drawn out one at a time in order,is 0 (or 1, or 2, or 3)”? Which clarifies the question, since now we speak of a cause, instead of implying it?What we we say “The count of the number of white marbles, drawn out together, is 0 (or 1, or 2, or 3)”. Woud this then be 1/4?

I draw three out together, 0 black, 3 white.

I draw three out together, 1 black, 2 white.

I draw three out together, 2 black, 1 white.

I draw three out together, 3 black, 0 white.

Nate,

Thank you! A very nice clarification. Though we could leave out “in order” (in your bold).

We don’t need to see his marbles

These are not the marbles we’re looking for

You can go about your business

Move along

Is this like the Monte Hall switcheroo

or

Like the vet who called the groomer to find out the sex(es) of the two dogs he left off … the dog the groomer was grooming was a male … what were the odds the other dog was a male as well?

Oh no not Monty Hall again. I have a headache.

Let p be a real-valued, non-negative function on the set {0,1,2,3} such that p(0)+p(1)+p(2)+p(3)=1. What is p(0)? I say the value of p(0) can’t be deduced from the information given. Mr. Briggs seems to say that p(0)=1/4 can be deduced. I seem to be missing something, as usual.

I bumped into some of this with a conditional probability problem. We assert this. Then, we assert that. The assumption is that math is done instantly, or with much concurrency. But, slowing this down, there is a moment in there when we emit a weak signal to infinity. Then, we finish asserting, and there we are back to the normal outcomes that we get graded for in statistics class.

Bayesians can accept any subjective probability, even something very wrong, because the Bayesian process keeps improving the priors until the Bayesians eventually get the correct answer. Feedback is key. If Frequentists want to assume that the prior is the answer, whose problem is that?

Briggs — Thanks for the link to Franklin’s paper on Logical Probability. It explains what you’re talking about and that there is a legitimate school supporting your POV.

David in Cal,

Well, even if all was a product of my fevered imagination, and no other person said what I’m saying, we’d still have to decide whether each proposition I asserted was true or false based on argument.

David W. Locke,

You’re stating the standard empirical bias that is built in to all classical stats. Quick example (I have another post in the works on this). Premise E = “A fairy, pixie, and gnome are in a room and only one will come out.” Proposition of interest, X = “The gnome comes out” has probability 1/3 related that evidence. A subjectivist can say, “I

feelthe probability is 1.3876%”. And you cannotprovehim wrong.Lastly, no Bayesian “updating” will ever save you here. There will never be any chance of “long runs”, or experiments of any kind.

This post seems very disturbing (yet enjoyable). I don’t really know what “cause” means here, but I wonder why it is deemed to be the culprit, rather than ordering. Ordering implies the ability to distinguish the three marbles from each other even if they have the same color, so it provides new information about the marbles. Without ordering information, the only cases are: BBB, BBW, BWW, and WWW.

Suppose we prepare a set up equivalent to set up 2 by starting with set up 1. We have an assistant first pick the number from 0 to 3, drop that number of white marbles into the box, followed by enough black marbles to make the total number of marbles equal 3. Then the box is shaken, randomizing the locations of the marbles. So long as we don’t know the number our assistant chose, then even without appealing to Laplace, this will yield the probability profile claimed for set up 2.

But now, if we draw the marbles from that box, we should find (from the symmetry of black and white) that the first marble has a 50% chance of being white. Suppose that it is white. Then this rules out only the case where our assistant chose 0, leaving 1, 2, and 3 as possibilities with equal probabilities. (If the marble had been black, then it would have ruled out the case where our assistant chose 3, leaving 0, 1, and 2 as possibilities with equal probabilities).

In an information-theoretic sense, we have been inquiring sub-optimally about the number our assistant picked. An optimal first query would be something like “did you pick an even number”, which would also have a 50% probability of either answer, but which would reduce the number of possibilities from 4 to 2, rather than from 4 to 3 as was the case when our first “query” was to draw a marble from the box.

In our sub-optimal way of inquiring with draws, it takes us 3 queries (i.e. draws) rather than the optimal 2 queries to determine what number our assistant picked (i.e. how many white marbles were put into the box). But because each of our three queries has two possible answers, if these answers have equal probability we are getting enough information to decide a 1 of 8 problem, despite that our original set-up was only derived from a 1 of 4 problem.

So our method of using draws leads to an over-specification of the situation in the box, and that over-specification relates to our imposing information about permutations that was not present in the original set up. For example, with this set up, W1B2B3, B1W2B3, and B1B2W3 are all really the same case (assistant chose 1 white marble), but our drawing procedure has split them into three cases, actually generating information that was not present in the original set up.

Given our setup, the probabilities of the various draws are…

0 white marbles

P(B1B2B3) = 1/4

1 white marbles

P(B1B2W3) = 1/12

P(B1W2B3) = 1/12

P(W1B2B3) = 1/12

2 white marbles

P(B1W2W3) = 1/12

P(W1B2W3) = 1/12

P(W1W2B3) = 1/12

3 white marbles

P(W1W2W3) = 1/4

So, although the probability of drawing a white marble in any of the three draws is 50%, the conditional probability of drawing a white marble, given the knowledge of prior draws and that the setup was independent of ordering is different. For example,

P(W1) = 1/2

P(W2|W1) = 2/3

P(W3|W1W2) = 3/4

I agree with you that the issue is measurement. I think this kind of thing is closely related to the statistics of identical particles in quantum physics, where swapping identical particles does not actually create a distinct physical state of the system and measurement often creates information that was not present in the system before the measurement took place.

Now if one marble was 10mm in diameter, one 12mm and one 14mm and you had to take them out in order, would it change anything?

Thanks for the reference to the Franklin paper: it is a good summary. But it seems to me that Franklin’s criticism of bayesianism is almost entirely aimed at the subjective variety that doesn’t mind where you get your priors. The kind that Jon Williamson calls empirically-based (where your priors are constrained to agree with known frequencies) seems to be most similar in practice to logical probability; and both have the drawback that in some cases there is no objective way to choose priors, so you are left with inequalities rather than crisp values. True objective bayesianism adds the constraint of assuming maximum uncertainty (or equivocation as Williamson calls it) and I still don’t see exactly how the logical probabilist justifies this assumption. The nearest Franklin gets to such a justification is the Dutch Book argument at the bottom of p292, but this may not fully generalize.

I’m still not convinced about case 1 (and equivalently 2). That there are four possible outcomes does not in and of itself say anything about the probability about those outcomes, except that Pr(A or B or C or D) = 1 and PR(~(A or B or C or D)) = 0. We might

assumethat – in the absence of other information – that all outcomes are equally likely, but that looks to me like we’re making up a minimally simple model in the absence of any other information, rather than making any useful statement about reality.If I have understood correctly, you have stated several times that randomness is not a cause; Pr(A|all possible information) is either 1 or 0. Rather, probability is an attempt to model what we don’t know. As such, the utility of a probability estimate is determined by the accuracy of the model.

So, case 1 is not Pr(X = 0) = 1/4. It is Pr(X = 0 | all outcomes are equally likely) = 1/4. If we have information that allows us to be more accurate about the givens (as in case 3), then our model (and thus probabilities) changes accordingly.

No?

Andrew,

You seem to be favoring the frequentist notion of probability. Instead, think of probability as a measure of certainty. The proposition is Pr(X = 0 | four possible values). With only that given, your certainty (or uncertainty) of the value of X must be the same for all values so Pr(X = 0 | four possible values)=1/4.

Or, one could say that we don’t even know how much we don’t know, and thus any probability we assign is arbitrary (save that by definition the sum of all possibilities must be 1). There’s a world of difference between saying “the most pragmatic estimate is 1/4” and “the probability is 1/4”. Unless the latter is shorthand for the former?

Let’s say that I happen to know the number was written by someone who rather likes ‘zero’. I might then claim that the probability of a zero is 2/5, and the rest is 1/5. You, not knowing this, claim that it’s 1/4. My extra knowledge makes my model more likely to be useful (I am deliberately avoiding the word accurate), but there’s still information that neither of us is privy to. We’re both estimating based on what we know (or don’t know), and remain ignorant of other factors which might refine our model.

Here’s the kicker: what empirical method could we use to determine whether your estimate is better than mine, or not? Or are we purely in the realm of philosophy?

Andrew,

Thank you.

Allprobability is conditional. The premises I gave aretheconditions, and theonlyconditions to be used in figuring the answers. When you say, “Let’s say that I happen to know the number was written by someone who rather likes ‘zero'”, etc., you are changing the conditions, changing the premises. Therefore, it is not surprising that you come to a different probability.There is no empirical method we can use to determine many questions. Given E = “The USA did not enter WWI”, what is the probability of X = “There was no WWII in Europe”? This is a perfectly understandable counterfactual question, and one often debated. Probability by argument can answer. But not empirically.

The model in the first example is

deduced, and is therefore accurate. Not all probabilities can be empirically checked.More to come.

DAV,

Exactly.

Bumble,

See the words about empiricism. Not all probabilities are empirical. Williamson also adds to probability (to all lists of premises) by insisting that the given probability by “maximally” something-or-other. I forget the word he used. He insists that all probabilities be single numbers. Which is weird.

Shack Toms,

Don’t forget all probability is conditional, therefore all your equations as written are technically incorrect. Trivial, perhaps, but writing them correctly helps to remember exactly what premises we’re working with—and which we’re not.

Also, about cause: you’re right that I owe a more thorough explanation, which I’ll provide later. Briefly, cause has four aspects: formal, material, efficient, and final. All are involved.

(pushing this to see what I’m not understanding)

Consider the test: I shake a fair 6-sided die in a cup and then tip it out. What is the probability of each result? We would all say 1/6. We also know that this isn’t strictly true, as if one could model the starting position and physical forces on the die well enough we could accurately predict how it will land, but on any pragmatic model we treat all results as being equally likely.

Now consider your number test: In one sense, it is the same. In another, it is different, because you haven’t specified any form of modelling that we can use to predict the outcome beyond the existence of four distinct results.

With the die, we can argue from both experience and theory that we would need extremely accurate measuring tools to improve our prediction beyond 1/6. With the number sampler, we’re picking a model out of the air. We know that

somemechanism existed by which the numbered paper got into the box, but without more information there’s no basis for arguing that a biased or unbiased method was chosen. The conditionals given serve to remove possibilities (the possibility that the result is 0,1,2 or 3 is 1), but they provide no information to give us confidence in any estimate beyond that.Follow-up question: is there a pragmatic mathematical argument for assuming a 1/4 probability where no model is specified? I can think of situational arguments, such as a wager. Given the initial premise, one would be foolish to wager at less than 3-1 odds, in which case I am assuming that the estimate of 1/4 is minimally useful.

Andrew,

What does

probabilitymean to you?As used here it is a measure of certainty or level of knowledge. It’s not a statement about slips of paper or faces of a die or how often they may turn up. It’s not a model. It’s a statement — assertion even — about how much is known. In Pf(X=0|four values) only one thing is known and that is X can be one of four values. Given

onlythat knowledge your certainty in X taking any particular value must be the same as that of X taking any other value.One of the problems when using these examples is the tendency for people to insert information that’s not present such as:

With the die, we can argue from both experience and theory.Andrew,

It may help to think of the problem in the following way. We are equally ignorant about the “writers affinity for numbers” or any other conditions. Therefor, our state of knowledge or belief must be that the probability is 1/4. This is a different perspective than thinking to yourself ‘What is the “random process” that generated these numbers?’, instead ask yourself to best quantify your state of belief/information about the problem.

If that doesn’t help, then ignore what I said. I may be wrong.

Will

Premise / background information –

Question –

Let me state the question in a way that clearly differentiate the variable X and the values it can assume.

Solution #0 –

Some people say they don’t know as there is no reason why ignorance state needs to be stated in a probability distribution.

Solution #Mystery –

Given the background information, I am puzzled as to how Laplace could prove the equal probability by starting with a hypergeometric distribution (HG) involving some sort of draws and ending with the equal probability. No way! I can be proved wrong easily just by a reference link!!! Yes, I’d take Laplac’s word.

There is an urn drawing version of proof for Laplace’s law of succession that uses HG, but it is a totally different story.

~to be continued~

Solution #1 – A possible outcome of 0,1,2,3

Some people apply principle of indifference (http://en.wikipedia.org/wiki/Principle_of_indifference) due to ignorance. That is, using classical probability (not frequentist probability) approach to assign equal probability to each of the 4 outcomes over which we are indifferent. Hence, ¼ for each of the outcomes described.

(Note the equal probability assignment yields a situation of maximum uncertainty in prediction, i.e., with a maximum entropy.)

Solution #2 – A different description of the possible outcomes, no drawing of any sort.

Given that premise, let’s call the three marbles A, B, and C. There are 8 possibilities, just like those 8 listed in the post,

O1. A, B, and C are all black,

O2. A is white, B is white, C is black

…, (magical ellipsis)

O7. B is black, C is black, A is white

O8. A, B, and C is white

A equal probably of 1/8 is assign to each. Now, examining all 8 outcome, the probability distribution of X is given in the post – the case in which Briggs inserts the unnecessary condition of “drawing.” A binomial probability distribution with n=3 and p=1/2.

Solution #1 and Solution #2 are not the same because of the re-description of the possible outcomes. Hence the well-known problem / paradox (?) of the principle of indifference. Bertand paradox is another example – http://en.wikipedia.org/wiki/Bertrand_paradox_%28probability%29.

Another issue raised by philosophers of probably is the conflation between classical probability and logical probability.

I define “probability” as “a quantified estimate of the likelihood of a particular event occurring, given a model of how the event space is generated”.

The definition in use here seems to add another level of indirection, more along the lines of “a quantified estimate of how much information we have about the likelihood of a particular event occurring”.

The latter is a measure of information; the former a measure of the output of a model.

a quantified estimate of how much information we have about the likelihood of a particular event occurringClose but it’s not an estimate. In the Pr(X=0|four values) case it is exact. Nor is it another level of abstraction.

Note the absence of a model. How X got to be zero is irrelevant. If you ponder a while you may see that a model is just what you think you know ( and/or maybe hope to prove) about something . Any Pr(event|model) is still the level of knowledge or certainty in the event. The “event” (e.g., X=0) is you discovering the value of X, i.e., acquiring knowledge and not about X becoming that value. It may or may not be based on what you think is driving X. In the example it is not based on causes of X values.

The expression really should be Pr(discovering X=0 | what we know) or, if an action culminating in a result, Pr(discovering the outcome|what we know or are pretending to know).The probability is all about the observer and not what’s being observed. In a sense, it is subjective as your knowledge might be different than someone else’s.

I’m having trouble seeing how the 2nd case analysis is using ALL of the provided information.

The 8 possibilities for the 3 draws follow from the known fact that there are only white and black balls in the box. However, the problem also specifies that there are only 3 balls in the box and that is incompatable with the 8 possibilities being equally likely.

If we call p(3w) the probability that there are 3 white balls in the box, then the fact that there are only 3 balls tells us that p(3w)=p(2w)=p(1w)=p(0w)=1/4 – that’s what was established in the first case. Given that, the probability of www should be:

p(www|3w)*p(3w) + p(www|2w)*p(2w) + p(www|1w) *p(1w) + p(www|0w)*p(0w) =

1*1/4 + 0*1/4 + 0*1/4 + 0*1/4 = 1/4.

I don’t see how this conclusion uses any information that wasn’t provided. It’s just using the information about the number of balls that was ignored in coming up with the 1/8 value.

Dr. Briggs,

I believe Shack Toms and Bob are right, the initial information (let’s call it I) is identical in the three setups, thus you can’t just count the number of possible orderings in the third setup and divide by the total, he (Shack) may have used incorrect notation but his results are right, as an example, if X = the number of white balls drawn is:

P (W1W2B3 / I) = P (W1W2B3 / X =0, I) x P ( X = 0 / I) + P (W1W2B3 / X =1, I) x P ( X = 1/ I) +P (W1W2B3 / X =2, I) x P ( X = 2/ I) + P (W1W2B3 / X =3, I) x P ( X = 3 / I) = 1/12.

which is different from your answer of 1/8.

If you ask for P ( X = 2 / I) you’ll get:

P ( X = 2 / I) = P ( X =2 / W1W2B3, I ) x P ( W1W2B3 / I) + P ( X =2 / W1B2W3, I ) x P ( W1B2W3 / I) + P ( X =2 / B1W2W3, I ) x P ( B1W2W3 / I),

P ( X = 2 / I) = 1 x (1/12) + 1 x (1/12) + 1 x (1/12) = 3/12 = 1/4.

which is different from your answer of 3/8.

p.s : calculations done considering no replacement.