William M. Briggs

Statistician to the Stars!

Subjective Versus Objective Bayes (Versus Frequentism): Part Final: Parameters!

ET Jaynes, Chance Master

(All the stuff in this series is, in a fuller form, in my new upcoming book, which is tentatively called Logical Probability and Statistics—but I’ve only changed the title 342 times, so don’t count on this one sticking. Incidentally, I’m looking for a publisher, so if you have a definite contact, please email me.)

Read Part IV.

From where do parameters emerge? And what is the difference between treating them objectively and subjectively? A lot of controversy here. The difficulty usually begins because the examples people set themselves to discuss these subjects so far advanced that they have too many (hidden) assumptions built in such that it makes understanding impossible. From our previous examples, we have seen it is better to start small and build slowly. We can’t avoid confusion this way, but we can lessen it.

If you haven’t reminded yourself of the last two posts on the statistical syllogism, do so now, for it is assumed here.

Premise: “There are N marbles in this bag into which I cannot see, from which I will pull out one and only one. The marbles may be all black, all white, or a mixture of these.” Proposition: “I pull out a white marble.”

What is the probability this proposition is true given this and only this premise? Well, what do we know? Black marbles are a possibility, and so are white. Green ones are not, nor are any other colors. We also know that the number of white marbles may be 0, 1, 2, …, N. And the same with black, but with a rigid symmetry: if there are j white marbles then there must be N – j black ones. (There may be other things in the bag, but if so, we are ignorant of them.)

This is not like the example where all we knew was that “Q was a possible outcome”, and where we assigned (0, 1] to the probability “A Q shows”, because there the number of different kinds of possibilities were unknown, and could have been infinite. Here there are two known possibilities. And N is finite.

Suppose N = 1. What can we say? “There are 2 states which could emerge from this process, just one of which is called white, and just one must emerge.” The phrasing is ugly, and doesn’t explicitly show all the information we have, but it is written this way to show the continuity between this premise and the one from last time (i.e. “There are n states which could emerge from this process, just one of which is called Q, and just one must emerge” and “A Q emerges”).

The probability of “I pull out a white marble” is, via the statistical syllogism, 1/2—again, when N = 1. This accords with intuition and with the definite, positive knowledge of the color and number of marbles we might find.

Now suppose N = 2. The number of possibilities has surely increased: there could be 0 whites, just 1 white, or 2 whites. But we’re still interested in what happens when we pull out just one and not more than one. That is, our premise is “There are 2 marbles in this bag into which I cannot see, from which I will pull out one and only one. The marbles may be all black, all white, or a mixture.” Proposition: “I pull out a white marble.”

If there were 0 whites, the probability of pulling out a white is clearly 0. If there were 1 white, we have direct recourse to the statistical syllogism and conclude the probability of pulling out a white is 1/2. If both marbles were white, the probability of pulling out a white is 1. Agreed?

But what, given our premise, is the probability that both marbles aren’t white? That just 1 is? That both are? Easy! We just use the statistical syllogism again. As: “There are 3 states which could emerge from this process, and only one of these three must emerge, which are 0 whites, 1 white, 2 whites” and “The state is 0 whites” or “1 white” or “2 whites”, and the “process” is whatever it was that put the marbles in the bag. Evidently, the probability is, via the statistical syllogism, 1/3 for each state.

It now becomes a simple matter to multiply our probabilities together, like this (skipping the notation):

     1/3 * 0 + 1/3 * 1/2 + 1/3 * 1 = 1/6 + 2/6 = 1/2,

where this is the probability of each state of the number of white marbles times the probability of pulling a white marble given this state, summed across all the possibilities (this is allowed because of easily derived rules of mathematical probability).

With N = 2, the answer for “I pull out a white marble” is again probability 1/2. The same is true (try it) for N = 3. And for N = 4, etc. Yes, for any N the probability is 1/2.

This is the same answer Laplace got, but he made the mistake of using the chance the sun would rise tomorrow for his example. Poor Pierre! Not a thing wrong with his philosophy, but every time somebody heard the example he turned into a rogue subjectivist and would not let Laplace’s fixed premises be. Readers could not, despite all pleas, stop themselves from saying, “I know more about the sun than this bare premise!” That is a true statement, but it is cheating to make it, as all subjectivists substitutions are.

Emphasis: we must always take the argument as it is specified. Adding or subtracting from it is to act subjectively, that is, it changes the argument.

We don’t have to use marbles in bags. We can say “white marbles” are “successes” in some process, and “black marbles” failures. As in it is a success if the sun rises tomorrow, or that a patient survives past 30 days, or more than X units of this product will be sold, and on and on. If we know nothing about some process except that only successes and failures are possibilities, and that there may be none of either (but must be some one one), and that we have N “trials” ahead of us, and N is finite, then the chance the first one we observe is a success is 1/2.

A “finite” N can be very, very, most very large, incidentally, so there is no practical limitation here. Later we’ll let it pass to the limit and see what happens.

This is the objectivist answer, which takes the argument as given and adds nothing about physics, biology, psychology, whatever. Of course, in many actual situations we have this kind of knowledge, and there is nothing in the world wrong with using it, provided it is laid out unambiguously in the premises and agreed to. To repeat: but if all we had was only the simple premise (as we have been using it), then the answer is 1/2 regardless of N.

The emergence of parameters and models

Since we have the full objectivist answer to the argument, it’s time to change the argument above to something more useful. Keep the same premise but change the conclusion/proposition to “Out of n < N observations, j will be successes/white,” with j = 0,1,2,…,n. In other words, we’re going to guess how many successes we see from the first n (out of N). Did we say same premise? Emphasis: same premise, no outside information. Resist temptation!

If n = 1, we already have the answer (the probability is 1/2 for j = 0, and j = 1). If n = 2, then j can be 0, 1, or 2. The statistical syllogism comes to our aid as before and in the same way. Before we’ve seen any “data”, i.e. taken any observations, i.e. have any knowledge about how many successes and failures lie ahead of us, etc., the probability of j equaling any number is 1/3 because there are 3 possibilities: no successes, just one, both. The result also works for any n: the probability is 1 / (n+1), and this works with n all the way up to N.

We’re done again. So now we must pose new arguments. Keep the same premise but augment it by the knowledge that “We have already seen n1 out of the n observations” where, of course, n1 is some number between 0 and n. We need a new conclusion/proposition. How about “The next observation will be a success/white marble”, where “next” means the “n+1″th.

In plainer words, we have some process where we’re expecting success and failures—N of them where N may be very exceptionally big—and we have already seen n1 out of n successes, and now we want to know the probability that the next observation will be a success. Make sense?

Notice once more we are taking the objectivist stance. There are no words, and no evidence, about any physicality, nor any about biology, nor anything. Just some logical “process” which produces “successes” and “failures.” We must never add or subtract from the given argument! (Has that been mentioned yet?)

Proving the answer now involves more math than is economical to write here. But it comes by recognizing that we have taken n marbles out of the bag, leaving N – n. Of these first n, some might be successes, some failures, with n1 + n0 = n. We’ll use these observations as new evidence, that is new premises, for ascertaining the probability that next (as yet unknown) observation is a success.

The probability of seeing a new success (given our premise) is (n1 + 1) / (n + 2). If we only pulled n = 1 out and it was a success, then the probability the next observation is a success is 2/3; if the first observation was a failure then probability next observation is a success is 1/3. Again, this is independent of N! But it does assume N is finite (though there is no practical restriction yet).

This answer is also the same as Laplace got. Where he went wrong was to put in an actual value for n: what he thought was the number of days that sun had already come up successfully, and then used the calculation to find the probability (given our premise) the sun would come up tomorrow. Well, this was some number not nearly equal to 1, because Laplace didn’t think the world that old. Since the probability was not close to 1, and people thought it should, logical probability took a big hit from which it is only now recovering. His critics did not understand they were changing his argument and acting like subjectivists.

Let’s play with our result some more. It’s starting to get good. Suppose n is big and all the observations n1 are successes, then (n1 + 1) / (n + 2) approaches 1. If we let the “process” go to the limit, then the probability becomes 1, as expected. The opposite is true is we never see any successes: at the limit, the probability becomes 0. Or, say, if we saw n1 = n/2, i.e. our sample was always half successes, half failures, then at the limit, the probability goes to 1/2.

It is from this limiting process that parameters emerge. Parameters are those little Greek letters written in probability distributions. But before we start that, let’s push our argument just a little further and ask, given our premise (which includes the n observations) not just what is the next observation will be, but what the next m observations will be. Obviously, m = 0 or 1 or 2 etc. all the way up to m.

The probability m takes any of those values turns out to be what is called a beta-binomial distribution, whose “parameters” are m, n1 + 1, and n0 + 1. Parameters is in quotes because here we know their exact values; there is no uncertainty in them; they have been deduced via the original premise in the presence of our observations. Notice that this result is independent of N, too (except for the knowledge that n and n + m are less than or equal to N).

The beta-binomial here is called a “predictive” distribution, meaning it gives the probability of seeing new “data” given what we saw in the old data.

In other words, we don’t need subjective (or objective) priors when we can envision a process which is analogous to a bag with a fixed number of successes and failures in it (with total N), and where we take observations n (or m) at a time. That’s it. This is the answer. We have deduced both a model in which are not needed unobservable parameters. No Greek letters here. The answer is totally, completely objective Bayes.

There is no subjectivity, no talk of “flat” or “uninformative” priors, no need for frequentist notions, no hypotheses, nothing but one bare premise, some rules of mathematical probability accepted by everybody (because they are true), and we have an answer to a question of real interest. Ain’t that slick!

Where It Starts To Go Wrong

Everything worked because N was finite. We lived in a discrete, finite world, where everything could be labeled and counted. Happily, this world is just like our real world, a hint that difficulties arise when we leave our real world and venture to the abstract world of infinites.

We could let N go to the limit immediately (before taking observations). Now no real thing is infinite. There never has been, nor will there ever be, an infinite number of anything, hence there cannot be an infinite number of observations of anything. In other words, there is just is no real need for notions of infinity. Not for any problem met by man. Which means we’ll go to infinity anyway, but only to please the mathematicians who must turn probability into math.

If N goes to the limit immediately, the number of successes and failures, or rather the ratio of the number of successes (in our imaginary bag) to N becomes an unobservable quantity, which we can call θ. If N passes to the limit, θ can no longer ever equal 0 or 1, as it could in real life, but will take some value in between.

We now have a brand-new problem in front of us. If we want to calculate the probability that the first observation is a success, we have to specify a value for θ by fiat, i.e. arbitrarily and completely subjectively, or we have to let it take each possible value it can take and then multiple the probability it takes this value by the probability of seeing a success given this θ. And then add up the results of each multiplication for each value of θ.

That’s easy to do with calculus, of course. But it leaves us the problem of specifying the probability θ takes each of its values (the number of which is now infinite). There is no other recourse but do to this subjectively. We could say (subjectively) that “Each value of θ is equally likely” or maybe “Let’s assume a flat prior for θ” which is the same thing, or we could say, “Let’s pick a non-informative prior for θ” which is also the same thing, but which involves invoking other probability distributions to describe the already unknown one on θ.

Is there any wonder that frequentists marvel at this sort of behavior? The “prior” appears as it is: arbitrary and subject to whim. (Of course, the frequentist after getting this argument right falsely assumes that frequentism must therefore be true; the fallacy of a false dichotomy.)

It turns out, in this happy case, that if a “flat” prior is assumed, the final answer is the same as above; i.e. we end with a beta-binomial predictive distribution. We start with a “binomial” and get a beta “posterior” on θ, but since we can never observe θ, it is another wonder that anybody ever cares about it. People do, though. Care about θ, to an extent that borders on mania. But that people do is a holdover from frequentist days, where “estimation” of the unobservable is the be-all and end-all of statistics. Another reason to cease teaching frequentism.

But the happenstance between the finite-objectivist answer with the mathematical-subjectivist one in this case is why subjectivism seems to work: it does, but only when the parameters are themselves constrained. Here θ can “live” only in some demarcated interval, (0,1). But if the parameter is itself without limit, as in the case of e.g. “normal” distributions, then we can quickly run into deep kimchee.

The problem comes from going to infinity too quickly in the process, a dangerous maneuver. Jaynes highlighted this to devastating effect in his book, destroying a “paradox” caused by beginning at infinity instead of ending with it. (See his Chapter 15 on the “marginalization paradox.”) If there is any blemish in Jaynes’s work, it is that he failed to apply this insight to every problem. In his derivation of the normal, for example, he began with the slightly ambiguous premise that measurement error can be in “any” direction. Well, if “any” means any of a set of measurable (by human instrument), then we are in business. But if it meant any in an infinite number of directions, then we begin where we should end.

All probability and statistics problems can be done as above: starting with simple premise, staying finite until the end, after which passing to the limit for the sake of ease in calculation may be done, and proceeding objectively, asking questions only about “observable” quantities. That is the nature of objective, or logical, probability.

Questions?

If there are a lot of questions, I might add an addendum post to this series. But please to read all the words above (at least once) before asking. On the other hand, I may just wait for the book.


69 Comments

  1. Briggs:

    The situation that you describe can be summarized by stating that Bayesian parameter estimation suffers from the lack of uniqueness of the non-informative prior with consequent violation of the law of non-contradiction. The prior that is associated with the limiting relative frequency does not suffer from this shortcoming thus offering up a solution to the problem of induction.

  2. “(There may be other things in the bag, but if so, we are ignorant of them.)” Really? According to the premise, we have no idea if anything else may be in the bag.

  3. “There are N marbles in this bag into which I cannot see, from which I will pull out one and only one. The marbles may be all black, all white, or a mixture of these.”

    Precision in language is important. While later context makes your intent clear, it is not clear from the premise itself whether “may be all black, all white, or a mixture of these” applies only to the marbles as a group or to individual marbles.

    This premise could be read to allow marbles that are:

    White on one side, black on the other
    White with black dots or black with white dots
    Grey

  4. Not mentioning the infinite case there are already problems in the finite case you propose.

    When you set up the experiment with “Premise: “There are N marbles in this bag into which I cannot see, from which I will pull out one and only one. The marbles may be all black, all white, or a mixture of these.” and proposition: “I pull out a white marble.”

    What you are doing is to fabricate a situation where you have already imprinted the 1/2 probability in it to then calculate it and voila! 1/2!! isn’t it amazing? Any frequentists would reach the same conclusion even after being banned from teaching frequentists ideas.

    How about if instead a bag with marbles you have a box with a button that and a speaker that says “black” or “white” every time you press it? What would be the probability to observe a success in the first pressing? 1/2? Sure, only if you are a Bayesian and you just make this up.

    Objectivists assign probabilities and then they are in awe when the probabilities they assign int their experiments pop up. I disagree in a philosophical level with Subjective Bayesians, but Objective Bayesians are simply wrong in their premises.

  5. Briggs

    21 May 2013 at 5:33 pm

    Gary,

    Just so, which is why I say we are ignorant about other items.

    MattS,

    I take “white marble” to mean “a marble which is white and not any other color.” If you worry about marbles, which is perhaps a fair criticism, call them “successes” and “failures”, or “As” and “Bs”.

    Fran,

    No, sir, this is false. The derivation of the 1/2 has been laid out, step-by-step, over a series of three separate posts. On the other hand, you merely claim the probability is 1/2. Why? Merely claiming the objectivist answer is wrong is not a proof.

    As I said above, forget the marbles. Make them “As” and “Bs”, or “Whattzits” and “Whozits”, or whatever else you like that makes the two items unambiguously dichotomous.

    Incidentally, about two posts back I laid out several known proofs showing where frequentism fails. You might try your hand at these. There are several more I didn’t include (I have a paper around which, if memory serves, gives 17 proofs frequentism fails; I’ll try and see where I put it; it’s from the philosophical literature, not statistics, so it’s not surprising it’s not well known).

  6. William Sears

    21 May 2013 at 5:51 pm

    There is clearly more to statistics than I, a mere statistical mechanic, have realized. The details look very complicated but maybe someday I will delve into them and add to my course. But then again probably not, as my students are confused enough as it is. In any case, an excellent series of posts, slow to get off the ground but a bang up finish. Cheers, Briggs.

  7. Briggs,

    I simply don’t get why having as a unique information an unambiguous dicotomy makes Objectivists believe that the probability for either is 1/2.

    This is not mathematical, nor logical, nor real. This 1/2 is nowhere in the problem and a completely made up assumption just like if I say one option’ probability is 1/3 for no other particular reason that Baby Jesus told me so.

    About the proof showing how frequentism fails, thank you for the renferneces, I’ll have a look at them yet, in the meanwhile, I leave you here evidences on how Objective Bayesianism fails in theoritical physics.

    http://euroflavour06.ifae.es/talks/Charles.pdf

    Too bad you are going to ban frequentism in physics too.

  8. Briggs,

    Even if you substitute “succsess” and “failure” for “white” and “black” in your original premise that doesn’t change the fact that the wording can be interpreted as any individual result can be a mixed case, neither a complete success nor a complete failure.

    The problem is structural to the premise and not an issue with the specific state words used.

  9. Just checking in.
    Thank you.
    Will need rest and coffee to digest this.

  10. Sander van der Wal

    22 May 2013 at 2:36 am

    Isn’t this an exercise in building theories that are able to make predictions with as little information as possible? So that with less information it is not possible to build a theory at all? An attempt to see if we understand the theory of theory building as good as we think we do?

    And therefore NOT a way to build theories that will be capable of making very good predictions for a wide variety of circumstances?

    Take Laplace and his Sun example. If you live in France, then the Sun comes up every day and it also sets every day. But if you live close to one of Earth’s poles, you will see that at certain times the Sun stays above the horizon, at other times that it will rise and sets, and later, that it will not rise at all, but stay below the horizon.

    In France, therefore, a simple theory for sunrise and sunset is sufficient. It might even be the objective Bayesian minimal theory, if you do not care about the length of the day, or the positions at the horizon where the sun rises and sets. But close to the poles, you need a much better theory, and the one that says that the earth is a rotating globe, with its axis of rotation inclined at a certain angle to its orbital plane, is a much better one.

  11. Briggs,

    So when there are two marbles in the bag which are white or black then the system has three possible states.

    But when I toss two pennies and say there are three possible outcomes my maths teacher gleefully declares, “Aha! No! There are four possible states. Head and tail is distinct from tail and head!”

    So why are they different?

  12. Briggs,
    “Just so, which is why I say we are ignorant about other items.”

    But the premise says nothing, precisely no thing ;-), about any other possible contents. Whether or not we are ignorant is irrelevant. You said there may be other things in the bag, but that is not allowed.

    OK, I’m being picky. Just trying to learn from the master.

  13. Briggs

    22 May 2013 at 9:09 am

    Gary,

    That we are ignorant is the point: we can say nothing except about what we know.

    Rich,

    The coins are tosses in sequence. This toss has two possibilities. The next toss has two. There are four possible end sequences: H1H2, H1T2, T1H2, T1T2.

    Sander,

    Not quite, not. It’s about using the exact information we do have precisely, and to show probability is a matter of logic, the quantification or specification of uncertainty. As I say in the text, all information can be used. But it’s best not to start with complicated examples before we master the simple ones.

    MattS,

    In plain English, saying a device can only produce a “success” or “failure” and nothing else does not imply, in any way, “partial success” or “complete success.” That’s why I said call them “A” and “B”.

    Fran,

    Very well. What is the frequentist answer to this:

    Premise: “We have a device that can only produce As and Bs: it may produce all As, all Bs, or some mixture of As and Bs. It will operate N times and then be destroyed forevermore.”

    Proposition: “The first output is an A.”

    What is the frequentist probability of the proposition?

  14. Briggs,

    Premise: “We have a device that can only produce As and Bs: it may produce all As, all Bs, or some mixture of As and Bs. It will operate N times and then be destroyed forevermore.”

    Proposition: “The first output is an A.”

    What is the frequentist probability of the proposition?

    There is no answer from a frequentist point of view; either the premise tells us the probability for As and Bs (e.g. we have a fair dice) or we have to wait for data to do inductive inference (e.g. we got 4 As and 2 Bs)

  15. Briggs

    22 May 2013 at 9:40 am

    Fran,

    Exactly so! No answer. Thus, frequentism fails, just because there is no “frequency.” See the two posts back for more examples of how frequentism fails (you must refute all of these examples to save frequentism).

    Here we have a perfectly interpretable, commonsense position (let N = 1 for an even clearer premise) which any would say has probability 1/2, but which has no frequentist probability.

    Think this through—I’m about to convert you—there are only a fixed number of outputs; N of them. Suppose N = 2 and we have seen the first observation; let it be an A. Now there is still no frequentist reply to the probability the next and last for all time output will be an A. Yet, of course, intuition suggests we should be able to answer. And we can, using logical probability.

    Frequentism always gets things backwards, when it doesn’t assert by circularity, as you just admitted but probably didn’t realize. You said, did you not, that the premise must either tell us what the probability is (circularity, or fiat, or raw subjectivity), or we have to wait for an infinite repetition of trials before we know with certainty what the probability for the next (next after infinity?) outcome is.

    No, no, no. I think the mistake comes from thinking of physical gambling devices, whose frequencies approximate the probabilities. It’s easy to swap these in your mind, switching the order.

    I contributed to this error, to some extent, by using marbles in the bag (this confused MattS). I should have always left it as a process with outputs. Implies the logical nature so much better.

  16. Well, no, it’s not at all necessary that the coins be tossed in sequence. Let’s say I have two assistants who simultaneously toss a coin each and who contribute one ball to the bag each, white if he got a head, black if he got a tail.

    How is this bag different to yours? Because knowing the protocol for adding balls to the bag we have extra information? But what is it?

  17. Briggs

    22 May 2013 at 10:38 am

    Rich,

    Yes, it’s equivalent to tossing one coin twice, or two coins once.

    The difference of the bag, if I understand your question correctly, is that marbles are coming out one by one. Once out, they are out and gone forever, and the only uncertainty we have left is what remains in the bag.

    And then maybe I don’t understand your question. Maybe put it another way?

  18. Laplace’s formula for the probability that the sun rises tomorrow is a consequence of the principle of entropy maximization: the prior probability density function over the various possibilities for the limiting relative frequency of the way in which the outcome of an event will occur maximizes the entropy (the missing information per event) of the prior PDF. Maximization of the entropy assigns equal numerical values to the probabilities of the various limiting relative frequency possibilities. A result is for the prior PDF to be flat.

    In the case that information is available about the limiting relative frequency, this information is expressed as a constraint on entropy maximization.This constraint yields a prior PDF that is not flat.

    The constraint provides the model builder with an opportunity for insertion of information from one or more mechanistic models into calculation of the probability values. The possibility for doing so successfully addresses some if not all of the objections that have been lodged over the years against Laplace’s formula.

  19. Briggs

    22 May 2013 at 11:09 am

    Terry,

    Rather, that the entropy is maximized is a result of the logic, and not a premise of it. We derive the concept of maximized entropy, we do not start with it.

  20. Briggs,

    Well, maybe you will convert me after all, I do not consider myself a fanatic person and if you give me convincing (to me) arguments I will bow and call you master but, so far, your Kung-Fu is not that strong. So, a few remarks:

    1 – Frequentism can only fail if there is a problem to be solved. Your premise poses no problem.

    2 – Probabilities only make sense when dealing with random variables. If you ask me a real world problem dealing with uncertainty I will apply an imaginary mathematical artifact named random variable to solve it.

    If you ask me for the probability of you owning a motorbike a frequentist should answer “you are an individual the question makes no sense” or put in a different way, the question makes as much sense as to wonder about the probability of 34… or probability of TRUE.

    This does not mean we cannot approach this problem at all. An appropriate frequentist answer would be “Though calculating the probability of you owning a motorbike makes no sense, I can tell you that 5% of male Bayesian statisticians own a motorbike”.

    And I explain this point to introduce my next point and answer to your problem.

    3 – Taking on your problem with N=2, we got A and you ask me about the next and last for all time output. The frequentist answer will depend on what model we use to answer the question you pose.

    If we assume a model were the outputs come from a pool of possibilities AA, AB, BA, BB and I observe A then I can tell the next output in this process will be A with a p=1/2 (but you call this circularity)

    If we assume a model where we have no information where A and B come from, then it is true a frequentist cannot tell anything about the next outcome. So if we have the outcome AAAAAAAAAAAAAAAAAAAB we could have this conversation:

    you: “what is the probability of the next outcome being A?”
    me: “I don’t know, it can’t be calculated”,
    you: “Aha! intuition suggests we should be able to answer”
    me: “All right, I can tell you that 19 out of 20 outcomes so far were A. We can assume a random model and talk about it. How about a Bernoulli random variable? time series?”
    you: “What is the probability of the Sun rising tomorrow?”
    me: “I don’t know, it can’t be calculated”.
    you: “Aha!”

    And so on…

  21. I was thinking about Rich’s question of 22 May 2013 at 4:13 am about what the difference is between (a) the 2 marbles in the bag and (b) 2 coin tosses, which Briggs explained by the coins being tossed in sequence.

    An illustration is the application to counting in quantum mechanics: if 2 particles with each 2 possible states with the same energy are distinguishable, then together, they can occupy 4 different states, each with probability 1/4. Like the coin toss. But if the 2 particles are indistinguishable, then together they can only occupy 3 different states, each with probability 1/3. Like the marbles in the bag. And nature actually works in this way.

  22. Briggs,

    “In plain English, saying a device can only produce a “success” or “failure” and nothing else does not imply, in any way, “partial success” or “complete success.”

    The problem is that your premise DOES NOT say the device can produce only and A or a B and nothing else.

    “The marbles may be all A, all B, or a mixture of these.”

    This can legitimately be read to say that the system can produce an A, a B or an AB.

  23. SteveBrooklineMA

    22 May 2013 at 3:08 pm

    If we have a bag of 1000 balls, each either black or white, and we “deduce” via the SS that the probability of drawing a white ball is p=1/2, do we also “deduce” that there are 500 white balls in the bag? Is what you are calling probability not p=#white/1000? When you say p=1/2, isn’t this simply an estimate of #white/1000, the latter being what I would call the (unknown) probability of drawing a white ball? It seems to me that all you are doing here is providing a principled way to estimate an unknown probability based on “zero trials.”

  24. SteveBrooklineMA

    22 May 2013 at 3:19 pm

    I can apply the statistical syllogism to the Sun Rising problem and get a completely different answer.

    Let N+1 be the number of days from the start of history through tomorrow. As Laplace did, suppose at first that we know nothing about the sun, neither the related physics nor the past history of its rising. Suppose we only know that the sun rises on some subset of the N+1 days in question. Thus we have, via the statistical syllogism, that all 2^(N+1) possible subsets of the N+1 days are equally likely to be the sun-rising subset, each having probability 2^(-N-1). Now suppose we are given additional knowledge, specifically that the sun rose on the first N days. There are then only two possibilities for the sun-rising subset: the set of all N+1 days, and the set that contains only the first N days. A trivial application of Bayes Rule shows that each of these possibilities now has the posterior probability of 1/2.

    Thus the probability of the sun rising tomorrow is 1/2.

  25. SteveBrooklineMA:

    The probability is not #white/1000 but rather is the expected value of the relative frequency of drawing white in an infinite number of draws with replacement into the bag of the drawn balls, the so-called “limiting relative frequency.” In addition to a having an expected value, the limiting relative frequency has an uncertainty. The uncertainty is great as no balls have been drawn.

  26. SteveBrooklineMA:

    I don’t follow your derivation of a probability value of 1/2. The derivation at http://en.wikipedia.org/wiki/Rule_of_succession finds the probability of s successes in n trials to be (s + 1)/(n + 2). Under the condition that s and n are identical, the probability is (n + 1)/(n + 2). This is Laplace’s probability that the sun will rise tomorrow.

    Though Laplace assigns a value to the probability that the sun will rise, this assignment provides one with no information about whether the sun will rise. This is a consequence from Laplace’s use of an uninformative prior probability density function. Through the use of an informative prior PDF, a modern scientist could correct this shortcoming thus yielding a model that was usable in decision making.

  27. SteveBrooklineMA

    22 May 2013 at 11:14 pm

    Terry-

    I don’t see how the probability is not #white/1000… surely the expected value of the relative frequency that you describe is #white/1000. If the uncertainty depends on the number of balls drawn, then it seems that what Briggs is describing is an estimate of #white/1000.

    As for the sun problem, I get a different answer from Briggs and Laplace even though I also use an uninformative prior via the statistical syllogism. The difference is that Briggs utilizes a uniform distribution on the possible number of sun-risings, while I use a uniform distribution on the possible subsets of sun-rising days.

  28. Briggs,

    I’m not sure how to say it another way. C de V’s answer seems helpful. I see that, while you simply declare that a bag with two marbles exists with certain constraints on their colours I’m questioning how the bag is put together.

    In your case, if you will, Hermione has waved her wand and produced a bag with two marbles in it: “May be one is white, maybe none are or maybe both are”.

    In my case, two friends have tossed a coin each and contributed a marble to the bag according to the outcome of the toss so: “May be one is white, maybe none are or maybe both are”.

    So knowing nothing about magic we decide Hermione’s bag has three possible states but as we know how coin tosses work we say there are three possible states for the bag but one has twice the probability of occurring as the other two.

    There seem to be endless probability problems with this theme. There’s “There are three cards, one red on both sides, one blue on both sides and one with one red side and one blue side. A card is dealt and it is red. What is the probability that the other side is red”. To get the right answer you have to realize that the unseen side could be either of the red side of the all-red card. Same tricksy problem.

    Something about “distinguishable states” seems right but I have no confidence that I’d get it right left to my own devices.

  29. Wish you had proper references. The book Chance and Choice Memorabilia by Kai Lai Chung, a famous Chinese probabilist/mathematician, comes to mind.

  30. Briggs

    23 May 2013 at 2:54 pm

    Fran,

    The premise poses no problem? Surely you can’t be serious…and I’m not calling you Shirley.

    Guy walks up to you on the street, holds out a bag in which he claims is either a white marble or a black marble and asks you the probability you reach in (assuming you don’t have a heart attack before completing the task, etc.) and pull out a white. And you say this is not a well posed problem?

    Well, that’s the danger of holding onto a theory too tightly. Those of us who have escaped from the grasp of the “long run” can see the problem, and even answer it.

    Frequentism was dead, dead, dead on arrival.

    Notice, too, you haven’t gone back to the other refutations and attempted to un-refute them. Don’t be lazy, now.

  31. Briggs,

    The premise poses no problem? Surely you can’t be serious…and I’m not calling you Shirley.

    I am serious, and don’t call me… oh… Shirley, it poses no problem in the sense that there is no answer to be given.

    Not being able to answer something is not the same as 50%. For example, let’s say you grow a daisy on your head, a phenomenon never seen before in medical history, and you go to see a Frequentist doctor and a Objective Bayesian doctor.

    Then you ask the following question: Doctor! Is it serious? What is the probability I will die tomorrow because of this? (your bag with the marbles dead or alive options)

    Frequentist doctor: “I cannot possibly
    answer to that. This case is unique.”

    O. Bayesian doctor: “Sorry pal, 50%, be ready to kick the bucket”

    Then you go on and ask “And how about the probability I die because of this in the next two days?”

    Frequentist doctor: “I cannot possibly
    answer to that. This case is unique.”

    O. Bayesian doctor: “Sorry pal, 50%, be ready to kick the… wait I minute, if I told you 50% when the problem posed one day… I cannot say 50% when you ask me the question again with two days without contradicting myself, can I? mmmm”

    Then the poor O. Bayesian realizes that he has no fracking clue either about the possibility you are going to die tomorrow, or the day after tomorrow, or any other day, after you grew a daisy in your head.

    At this point the Bayesian doctor turns Frequentist, buys a drink to the Frequentist colleague and introduce his sister to him.

    Frequentism was dead, dead, dead on arrival.

    You forgot the WUHAHAHA after the third “dead”.

    Notice, too, you haven’t gone back to the other refutations and attempted to un-refute them. Don’t be lazy, now.

    The refutations in part II of this series of posts, right? I plead guilty. Just tell me your favorite one to start with. But, Anyhow, talking about being lazy, you sneaked away from the claim of theoretical physicists about O. Bayes failing big time for some problems (I gave you the link for one of them).

    What are these poor physicists going to do when you take their frequentist toys away? uh?

  32. Fran, I love the conversations between the two doctors!

    If there is equal number of white and black balls, then the objective probability of drawing a white or black ball is ½.

    If I don’t know the number of white and black balls, the principle of symmetry (POS) says that I don’t have more reason to assume (or I can’t be more confident) that the drawn ball is white than that it’s black. Hence the probability of ½ is assign to the event of drawing a white ball.

    I don’t know how Briggs uses statistical syllogism to assign a probability of 1/3 to each of the 3 states (0, or 1 or 2 white balls.) Well, more appropriately, I can’t defend the probability of 1/3 without appeal to POS!

    I simply see non-informative priors (which is also called objective priors in Bayesian statistics), such as uniform, as a formal representation of ignorance. A non-informative prior can produce bad results!

  33. You visit a town where every home is occupied by a family and every family has exactly two children. You approach a door and knock on it. A boy answers the door. What’s the probability that his sibling is a boy too?

    There, another one.

  34. SteveBrooklineMA

    24 May 2013 at 10:55 am

    Rich- The probability p=1/2. If, on the other hand, the mother answers the door and says she has at least one boy, then p=2/3. No?

  35. SteveBrooklineMA

    24 May 2013 at 11:05 am

    Hey Fran- Would you call yourself a “Frequentist”? I have to say I don’t recall ever meeting such a person. I’ve met plenty of self-proclaimed Bayesians however. My experience is that the Bayesian battle against “Frequentists” is a battle against straw men. Does anyone really define probability in terms of limits of ratios of coin flips and such anymore?

    I laughed aloud at your daisy bit.

  36. @JH

    I simply see non-informative priors (which is also called objective priors in Bayesian statistics), such as uniform, as a formal representation of ignorance.

    There is no such thing as a non-informative prior. In the example I gave with the doctors how would represent your ignorance? What among the many principles O. Bayesians have would you choose? If you use a uniform for the days of live Briggs might have, how many days would you use? And let’s say you use 100,000, would then tell him the probability to die tomorrow from daisy head growth is 1/100,000? Isn’t obvious we can really tell anything about it?

    @SteveBrooklineMA

    Hey Fran- Would you call yourself a “Frequentist”?/

    I don’t call myself a “Frequentist” neither an “Atheist” but this is how Bayesians and Christians call non-believers.

    I have to say I don’t recall ever meeting such a person.

    Oh! but there are many “Frequentist”! Here is my favorite:

    “You can, for example, never foretell what one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant.”
    Sherlock Holmes (The Sign of Four)

    On the other hand there are also famous Objective Bayesians around, here is my favorite:

    “Those are my principles, and if you don’t like them I have others.” Groucho Marx

    But if you have never seen a legendary Frequentist you are welcome to Barcelona! I do tricks for a beer and you can take a photo with me.

  37. Fran:

    For the days of remaining life for Briggs, the flat prior is uninformative. So are non-flat priors that are of infinite number. To have more than one uninformative prior is to violate the law of non-contradiction. The necessity for avoidance of violation of this law has the consequence that the days of remaining life for Briggs is not a variable upon which prior or posterior probability density functions can be erected.

    There is a variable for which the flat prior is unique. This variable is the limiting relative frequency of the outcome in a sequence of trials of an experiment. It follows from the uniqueness that both prior and posterior probability density functions can be erected on this variable.

  38. Fran,

    I am not sure how much statistics you know. Anyway, the doctor can obviously tell Briggs (B) that he doesn’t know the probability that B will NOT die tomorrow. (I don’t like to ponder on the probability that Briggs or anyone will die tomorrow.) But… the doctor is a Bayesian.

    A Bayesian model includes (1) a probability model for data, (2) a prior that quantifies the uncertainty of parameters, (3) determination of posterior distribution and (4) making inference.

    Let me give a hand-waving explanation that tries to establish a connection with the content of this post.

    The probability that B won’t die is the parameter of interest in (2). Since the doctor is a Bayesian, to proceed with Bayesian analysis, he is basically forced to quantify his ignorance on Day one when he has no information about the probability. A non-informative prior is used.

    On Day 2, B goes to see the doctor and the doctor notices that B is as alive and jolly as usual. Given he knows that B is not dead, at step (3), he would use Bayes theorem to update his probability assessment that B won’t die tomorrow.

    Now, think of the Laplace succession probability (LSP) in this post with “pulling a white marble” (or “the sun will rise tomorrow“) replaced by “B not dead.” So, many many days later, if B is repeatedly not dead, plugging in the LSP formula, sooner or later, the Bayesian doctor would update his assessment by concluding that the sun will always rise and B will live forever. ^_^

    (Note I am skipping the importance of step (1).)

  39. Coin tossing is a commonly used example. You’d like to assess the probability of heads (the parameter). Prior to any collection of data, you have no idea about the probability. A Bayesian would employ a non-informative prior to assess the uncertainty of the parameter values at step (2). That is, a uniform prior probability of ½ can be assigned to the probability of heads.

    Once you have collected data. Given the observed data, Bayesian statistics allows you to update your assessment based on the postulations at step (1) and (2). In this case, with proper premises, a binomial model will be used in (1). At step (3), you updated your assessment of the probability using Bayes theorem and the data at hand. As you toss the coins more times, (3) can be updated accordingly. Finally, one can predict the outcome of the next toss.

    In this case, freqentist analysis can work.

  40. SteveBrooklineMA

    26 May 2013 at 12:14 am

    It seems the formula (n1+1)/(n+2) is the maximum likelihood estimator for p if you pretend to have collected 2 data points, 1 success and 1 failure, before you collect n real data points. So you have n1 real sucesses + 1 pretend success and n real trails + 2 pretend trials.

    I would guess that for small n this would have a regularizing effect on the estimate of p, pulling the estimate towards 1/2, though for large n it should become less and less different from the standard ML estimate n1/n. It seems you could generalize to (n1+k)/(n+2*k) or even (n1+k)/(n1+k+m), the latter being k fictitious successes and m fictitious failures.

  41. Cees de Valk

    26 May 2013 at 3:15 am

    @ Rich on 23 May 2013 at 3:26 am said: “C de V’s answer seems helpful.”

    After some thought, I don’t find it helpful anymore myself :). Marbles are classical particles, not bosons (and for fermions, they answer would be very different again).

    In agreement with your suggestion, I would assign probability 1/2 to the “mixed” state. Because even though (White,Black) and (Black,White) are not distinguished as individual outcomes (as they would be when the marbles are drawn in succession), they can be distinguished without extending the premise. They must exist, and therefore, must be counted. It does not matter whether you draw the two marbles together or in succession.

  42. SteveBrooklineMA: If I remember the official answer correctly there are three possibilities not two: the other child is a girl or the other child is a boy or the other child is the other boy. So 2/3. But as I said I have no confidence that I would get any of these right. Maybe there’s a professional around who could help.

    Cees de Valk: yes, plausible but according to our host wrong. For reasons that remain obscure to me.

    “Two paths in a wood and I
    Tossed a coin to choose and went
    The wrong way”.

  43. Rich, SteveBrooklineMA:

    You visit a town where every home is occupied by a family and every family has exactly two children. You approach a door and knock on it. A boy answers the door. What’s the probability that his sibling is a boy too? There, another one.

    As such the problem has no solution since you didn’t say anything about the gender distribution.

    But let’s say that any combination BB, BG and GG is equally likely.

    1 – If the mother says “I have at least one boy” this equals to say that the only equally likely options left are BB and BG and, therefore, the probability of the other sibling to be a boy is 1/2.

    2 – If a boy opens the door we have more information than in case 1. We not only know that the only two equally likely options left are BB and BG, but we also have the result of a little “experiment” letting one of the members of this two groups BB and BG to open the door.

    If we assume that any of the B or G can equally likely open the door then we have that the possible situations for the outcome of this experiment are: B1 in BB opens the door, B2 in BB opens the door and B in BG opens the door.

    So 2 out of the 3 outcomes are boys, and therefore the probability for the other sibling to be a boy is p=2/3.

    there.

  44. Briggs

    26 May 2013 at 8:13 am

    Fran,

    Haven’t been keeping up with this. But notice you have committed usual error in your latest derivation:

    “But let’s say that any combination BB, BG and GG is equally likely.”

    You assume by fiat and do not deduce from first principles. This works sometimes, as when your assumption matches the deduction. But that only happens in the simplest of situations.

    Just to tweak the spring: Premise: “More than 90% of Xs are Fs and x is an X.” Conclusion: “x is F.” Any mind not previously addled by frequentism (notice this doesn’t blame you but those who transmitted the virus to you) would say the probability of the conclusion is (0.9, 1]. The frequentist must just insist, while blushing to retain his sanity, that there is no answer.

  45. SteveBrooklineMA

    26 May 2013 at 11:31 am

    Fran et al.: I get a different answer, but the problem with this sort of problem is often the language used, which is typically ambiguous. When I said “the mother answers the door” I meant only she answers, and we have seen none of her children. If she tells us then that she has at least one boy, then p=2/3 that she has two. If a boy alone answers the door, then the probability he has a brother is 1/2. It seems pretty clear I disagree with Fran on this.

  46. SteveBrooklineMA

    26 May 2013 at 1:27 pm

    Here is a Frequentist program that shows why an interpretation of the door-answer problem gives 1/2:

    1) Set Trials=0.
    2) Set CountTwoBoys=0.
    3) Choose a random gender for child 1 (M or F, Prob(M)=1/2).
    4) Choose a random gender for child 2 (M or F, Prob(M)=1/2).
    5) Choose a random child to answer the door (child 1 or child 2, Prob(child 1)=1/2).
    6) If child chosen in step 5 is a girl, go to step 3.
    7) If sibling of child chosen in step 5 is a boy, set CountTwoBoys=CountTwoBoys+1.
    8) Set Trials=Trials+1.
    9) Print CountTwoBoys/Trials.
    10) Go to step 3.

  47. SteveBrooklineMA,

    You have it wrong; but I agree that human language leads to ambiguities and we end up answering different math questions in our minds which, as we set them, they are all correct, but we get lost in translation. let me rephrase the problem:

    1 – The mother says there is at least one boy

    this information reduces scenarios to the equally likely BB and BG. that’s all you have. So half of the time the mother gives you this info the other sibling will be a boy, thus p = 1/2.

    What it is true though, is that 2/3 of the mothers will give that answer.

    2 – You peek inside and you see one boy (aka opens the door)

    this information reduces scenarios to:

    a – you saw brother B1 (other sibling is B2)
    b – you saw brother B2 (other sibling is B1)
    c – you saw brother B (other sibling is G)

    in your code you are merging a and b into one scenario losing this way some information you have about the sampling method. But if you account for everything p = 2/3.

    What it is true though, is that 1/2 of the times you peek inside you will see a boy.

    If you still don’t get it try to think about it in terms of information and entropy. p=1/2 has higher entropy than p=2/3. Would make sense that having less information (mother) yields a lower entropy result? nope.

  48. Briggs,

    You know, to me is depressing/exciting that statisticians cannot agree on this.

    Just like when physicists bitterly debated whether light was a particle or a wave, every time I see this kind of dichotomies in science I have the feeling that the truth must lay somewhere else.

    Seems to me common sense that given a problem there must be a unique best solution and, when more than one solution is offered, then problem contains some ambiguity.

    I have no problems using Bayes when a properly informed prior is in place, in fact, in your “showdown” Objective vs Subjective vs Frequentists you forgot to add Empirical Bayes. I am really curious to know your opinion about this approach because this is the one no Frequentist should have a problem with.

    People you call Frequentist are more eclectic when facing a problem than Bayesians, we let the problem to choose the tool yet, it seems that you only root for one tool (O. Bayes) and you’re so convinced about its merits that you do not hesitate to ask for a mass boycott of anything that is not your way.

    So, if somebody ask me “probability of x?” I usually say p*100% despite, first, it is not a probability but a percentage and, therefore, a frequentist proportion and, second, makes no mathematical sense.

    Saying “x is not a random variable therefore probabilities cannot be applied to it but I can tell you that given an X that behaves so and so then P(X=x)=p” is waaaay to long and an absurd to communicate in a natural language (English, Spanish…) but all this is what I imply every time I say p*100% to someone.

    But this is just semantics, not really important, what I consider important is you pushing priors when the problem does not justify doing so based on arcane principles that must be accepted of faith.

  49. SteveBrooklineMA, Briggs

    But notice you have committed usual error in your latest derivation: “But let’s say that any combination BB, BG and GG is equally likely.”

    Yes, you are right. It makes sense for this problem to assume BB,BG,GB,GG equally likely. Which changes the answer p when the mother says: “I have at least one boy”. Now the scenarios left are BB, BG, GB and the p of the other sibling being also a boy is p=1/3.

    But if one boy opens the door, the p for the other sibling to be a boy is still p=2/3 for the reasons explained above.

  50. SteveBrooklineMA

    27 May 2013 at 11:56 am

    Fran- Yes, I agree p=1/3 is right, I erred in saying it was 2/3 (mother answers door scenario).

  51. SteveBrooklineMA

    27 May 2013 at 12:09 pm

    Mother scenario:

    older,younger
    ————-
    B,B
    B,G
    G,B
    G,G

    In the mother scenario, 3/4 of mothers will tell you they have at least one boy. If she does tell you that, p=1/3 she has 2 boys.

    As for the boy answers scenario:

    older,younger (and * indicating who answered the door)
    ————-
    B*,B
    B*,G
    G*,B
    G*,G
    B,B*
    B,G*
    G,B*
    G,G*

    Of the eight possibilites, 4 have a boy answering. Two of those 4 have 2 boys in the family. Thus p=1/2.

    Fran, I think you need 2 split your case c into two… one where the younger sibling is a boy who answers the door and has an older sister and one where the older sibling is a boy who answers the door and has a younger sister.

  52. Fran- Yes, I agree p=1/3 is right, I erred in saying it was 2/3 (mother answers door scenario).

    Right but, you know, I erred too when assuming BB,GG,BG,GB equally likely would not changed the p in the “Boy opens door” for the other sibling being a Boy. It is actually p=1/2 because now the two boys from BB can open the door but so can the boys from BG and from GB. So this gives 4 scenarios 2 of them with girls.

    So much for my entropy argument… mind twisting problem this is, ha! :)

  53. SteveBrooklineMA,

    I just saw the second posts. The older, younger argument seems unnecessary though… You simply get a symmetric situation for all the calculations. Don’t you think?

  54. SteveBrooklineMA

    27 May 2013 at 3:32 pm

    Older/younger is probably unnecessary, yes. Sometimes it helps me in thinking about the problem though. Have you been reading this blog for a while? Some of us have been arguing all this for a while, but it doesn’t seem like we’ve gotten anywhere :)

    http://wmbriggs.com/blog/?p=1928

  55. SteveBrooklineMA,

    No, I recently discovered Briggs’ blog which I really like (despite he wants to quarantine me and my frequentist virus). In fact I am fairly new to blogging in general, but anyhow, thanks for think; that was a very good post from Briggs and I really like the example you give here

    http://wmbriggs.com/blog/?p=1928&cpage=1#comment-15303

    with the two coins.

    The quote about Keynes

    For Keynes, probability was a branch of logic. He divided statements into three rough, overlapping categories. Statements which could have quantitative probability assessed, those that were only comparative, and those which are impossible to quantify.

    Makes it easier to me to explain why I just don’t get Bayesian arguments.

    If I let you pick one coin from a bag of infinite coins which p’s are uniformly distributed and I ask you about its p, we both Frequentist and Bayesians alike agree on E(p)=1/2 so we estimate p=1/2. So this example goes into the first category.

    But if now it is my that I give you a coin telling you nothing about its p then, for some reason beyond me, Bayesians still place this example in the first category and say p=1/2 whereas a Frequentist would place it in the third category. That is, impossible to quantify until you give me more data.

    So far every argument from Bayesians I’ve heard trying to justify their “uninformative” priors sounds to me like this step two.

    http://alt255.com/miracle_cartoon.jpg

    The examples from jl with his students betting on the fair/unfair coin to make them understand probability is a belief I find them totally unconvincing and in the lines of what I have heard/read so far. No wonder his students show so much resistance to “believe”.

    But I would like to ask this to Bayesians here; why don’t you go Empirical Bayes? You still get to use absolutely all your Bayesian goodies and the conflict would be pretty much over. Those that do not accept probability as a belief would not have problems using your techniques and you still can claim superiority over the non-Bayesian techniques based on grounds of the clarity when talking about probabilities instead p-values… or whatever. So why don’t you?

    Sometimes I wonder if there is a correlation between the interpretation of probability and the degree of religious belief. I’d love to make a survey on that.

  56. Fran:

    Here’s why, in Bayesian statistics, the probability of heads is 1/2 when one knows nothing about the outcomes of coin flips:

    One knows that in a flip of a coin, the relative frequency of heads will be 0 or 1. In two flips, the relative frequency will be 0 or 1/2 or 1. In three flips, the relative frequency will be 0 or 1/3 or 2/3 or 1. In N flips the relative frequency will be 0 or 1/N or 2/N or … or 1. Note that the relative frequency possibilities are evenly spaced in the interval between 0 and 1 and that the difference between adjacent possibilities is 1/N, a constant.

    Now, let N increase without limit. The relative frequency becomes known as the “limiting relative frequency.” The limiting relative frequency possibilities are, as before, 0 or 1/N or 2/N or … or 1. In the period prior to the first flip, information about the outcome from a flip is completely missing. Thus, in assigning probability values to the various limiting relative frequency possibilities, one maximizes the missing information about the limiting relative frequency. This results in the assignment of equal probability values to the various limiting relative frequency possibilities. That these values are equal has the consequence that the function which maps the limiting relative frequency possibilities to their probability densities is flat. This function is the prior probability density function for the limiting relative frequency.

    In Bayesian statistics, the probability of heads is defined as the expected value of the limiting relative frequency. This value can be extracted from the flatness of the prior probability density function. It is 1/2.

    In the phrase “probability of heads,” the Bayesian definition of “probability” differs from the frequentist definition of “probability.” Under the Bayesian definition, the probability is the expected value of the limiting relative frequency. Under the frequentist definition, the probability is the limiting relative frequency. Both definitions are consistent with probability theory.

    In concert with information theory, the Bayesian definition yields a solution to the problem of how to assign a value to the probability of heads. By itself, the frequentist definition yields no such solution. Often, frequentists have favored the straight rule. Under this rule, the probability of heads is the relative frequency of heads in a sequence of flips. The straight rule, however, violates entropy maximization. Entropy maximization is a principle of reasoning. In contrast, the Bayesian solution is consistent with entropy maximization. Thus, logic forces one to accept the Bayesian solution.

  57. Terry Oldberg,

    Thanks Terry for the explanation, the MAXENT principle is just one of the many principles/priors Objective Bayesians use regardless whether the problem says so.

    In fact, to me make more sense the Subjective Bayesian approach. At least the openly claim everything is a belief and they don’t try to justify their beliefs beyond philosophical grounds.

    Actually, if Subjective Bayesians would talk about Minimum Indifference Values instead Probabilities I’d be okay with that. They would be simply modeling their indifference and step one in their inference process would be to determine how indifferent they are to the outcome and how their indifference has been updated.

    But they call it probability, then Frequentists say the problem offers no grounds to claim the prior they are using and here we go again. The only thing that separates me from Subjective Bayesians is semantics.

    But when Objective Bayesians claim that there are mathematical grounds to express probability priors when no information is available… Well, I haven’t found any convincing argument and not the least because among Objective Bayesians themselves there are arguments about it too! MAXENT is not the only principle in Objective Town; that is why they recall me Grouch Marx with his “These are my principles but if you don’t like them I have others”.

    Well, I guess that until an Statistical Moses take us all to the promise land I will keep praying to the God that better suits my problems.

  58. Briggs

    28 May 2013 at 12:17 pm

    Lot of bold words, there, Fran old son. But your frequent attempts have been fruitful in allowing me to think up other examples which prove the old ways of thinking have expired.

    Premise #1 (P1): “More than 90% of Xs are Fs and x is an X”.
    Premise #2 (P2): “Fewer than 10% of Xs are Fs and x is an X”.
    Proposition: “x is F.”

    Now the objective, i.e true, probabilities are Pr( “x is F” | P1) = (0.9, 1] and Pr( “x is F” | P2) = [0, 0.1). Frequentists of course have to say “No answer” to both. As before, we hope they at least blush while doing so.

    But we also have Pr( “x is F” | P1) > Pr( “x is F” | P2), which anybody not infected with theory would agree to. I’ll let readers come up with pertinent examples from this schema. (Do it! It’s fun!) Frequentists also must deny this truth.

    But they would probably not do so in person. There is only so much embarrassment a frame can take.

  59. Fran:

    Thanks for listening! I’m going to try to give you a sense of why I don’t like the subjective Bayesian approach.

    In selecting the inferences that will be made by a model, the builder has to choose between using heuristics and using optimization. Usually, model builders select heuristics but this approach has a logical shortcoming: on each occasion in which a particular heuristic selects a particular inference, a different heuristic selects a different inference. In this way, the method of heuristics violates Aristotle’s law of non-contradiction.

    The violations of non-contradiction yield variability in the quality of the model in which the achieved quality depends upon luck in the selection of the heuristics that are employed. A properly constituted optimization, on the other hand, does not violate non-contradiction and consistently yields a model of the highest possible quality. The MAXENT is a facet of such an optimization. Subjective selection of the prior probability density function is an example of a heuristic.

  60. Lot of bold words, there, Fran old son.

    So The Empire Strikes Back and now I know how Luke Skywalker must have felt when he lost his hand.

    About the bold thing, I just follow Thomas Jefferson advice when he said “Question with boldness even the existence of a God”… Not the least makes the reading easier.

    Frequentists of course have to say “No answer” to both. As before, we hope they at least blush while doing so… But they would probably not do so in person. There is only so much embarrassment a frame can take.

    But of course we have The Return of the Jedi and Luke Sky… sorry, a frequentist would rightly say that what you just wrote makes no sense because probabilities can be only applied to random variables, not to premises or constants, and would simply rewrite what you just said in the following way:

    Let X, P1, P2 be random variables
    and F be a value within X sample space.

    given P(X=F|P1)>0.9 and P(X=F|P2)P(X=F|P2)

    But, as I said this is just semantics, where we are going to really disagree is how on Earth did you know about those probabilities being 0.9 and 0.1.

    So, do I take your mask off now… Father?

  61. Briggs,

    The formating went to hell and I can’t edit it so… sorry Dad:

    given P(X=F|P1)>0.9 and P(X=F|P2)P(X=F|P2)

  62. Briggs,

    The formating went to hell and I can’t edit it so… sorry Dad II:

    given P(X=F|P1)>0.9 and 0.1>P(X=F|P2) then P(X=F|P1)>P(X=F|P2)

  63. SteveBrooklineMA

    28 May 2013 at 3:29 pm

    A frequentist could say P(x in F|P1) \in (.9,1]. I would bristle at seeing an “=” instead of “\in” since probability to me should be a measure.

  64. SteveBrooklineMA

    28 May 2013 at 3:39 pm

    Terry (9:24 am)- It seems odd to me that you would apply an entropy principle to a derived statistic like #heads/N. There may be any number of statistics obtainable from a data set, and it’s not clear that the same principle applied to them would yield compatible results. If N=2 flips, you could get the outcomes(H,H), (H,T), (T,H) or (T,T). Would these be a-priori equally likely? For large N, you would have 2^N possible flip N-tuples, and if these were all equally likely you would get a distribution for #heads/N which looks like a Dirac delta at 1/2, not a uniform distribution on [0,1].

  65. SteveBrooklineMA:

    Thanks for giving me the opportunity to clarify. The MAXENT states that “the missing information about the way in which the outcome will occur is maximized, under constraints expressing the available information.” Each limiting relative frequency possibility in the sequence 0, 1/N, 2/N, … , 1 is a way in which the outcome can occur, given that the outcome is the limiting relative frequency.

    Prior to observation of the outcomes of any trials of the associated experiment, the available information is nil and thus the MAXENT reduces to maximization of the missing information about the way in which the outcome will occur. This has the consequence that equal numerical values are assigned to the probabilities of the various limiting relative frequency possibilities. It follows that the prior probability density function over the set of limiting relative frequency possibilities is uniform.

  66. SteveBrooklineMA

    28 May 2013 at 9:33 pm

    Terry- I don’t really understand the reasoning there. If N=2, and without any additional a-priori information P(H,H)= P(H,T)= P(T,H)= P(H,H)= 1/4, then the distribution of p=#heads/N is P(p=0)=1/4, P(p=1/2)=1/2, P(p=1)=1/4. That’s not a uniform distribution on p.

  67. Acts 26:24

  68. SteveBrooklineMA:

    SteveBrooklineMA:

    It sounds as though you’re confusing yourself by mixing Bayesian ideas with frequentist ones. The idea of a prior probability density function is Bayesian. Frequentism is a consequence from development of a mathematical statistics in which a prior PDF plays no role. The founders of frequentism were logicians who pointed out, with nearly perfect accuracy, that the choice of prior PDF was arbitrary. A logical consequence from arbitrariness was violation of the law of non-contradiction.

    There is an exception to the rule that the choice of prior PDF is arbitrary. I’ve exploited this exception in my proof of the existence and uniqueness of a non-informative prior PDF over the set of limiting relative frequency possibilities.

Comments are closed.

© 2014 William M. Briggs

Theme by Anders NorenUp ↑