What Is A True Model? What Makes A Good One?: Part I

The rules of logic are simple. Use whatever evidence you have, and no other, to figure the probability of a conclusion. These rules are all we need to understand what is a model and what makes it good.

Given just that (our model) E1 = “All Martians wear hats and George is a Martian” what can we say about C = “George wears a hat”? We deduce that C is true, that it has probability 1. Change the evidence upon which we base our conclusion and we can change the probability. Given E2 = “No Martians wear hats and George is a Martian” then C is false (probability 0); or given E3 = “Half of all Martians wear hats and George is a Martian” then the probability of C is 0.5

Nowhere in this analysis is there word one about whether E1, E3, or E3 (our possible models) are themselves true. We cannot know whether they are true or false unless we supply evidence upon which we judge them. In the arguments just used, we accepted that each were true in turn.

Suppose we have not yet met George and we want to guess whether he will wear a hat. We have three sets of evidence; that is, we have three different models. Models are just the stuff that is given, the evidence we accept as relevant or probative to the conclusion.

Well, we have three models E1, E2, E3. We might have learned each E from three friends who have visited Mars, each swearing the veracity. We might have dreamed them. It is (now, anyway) irrelevant how we learned them, we only need to know that it is these three, and no other, models that we must consider. Which model is the true model, which two are false?

We have asked another probability question, which we already know how to answer. What is our evidence? M = “We have three models, only one of which is true.” Given this M(eta evidence), what is the probability that E1 is true? One-third, and also one-third for each of the other two. This quantification can actually be derived and is the result of what is called the statistical syllogism (see this paper). But I hope it is an obvious answer.

That’s is. We’re done, at least as far as judging the probability of each model. Our only evidence was M, and we have said all there is to say given M.

But what about old George? Before we meet him, we’d like to know the probability he wears a hat. Accepting in turn the truth of each model, we know that this probability is 1, 0 or 0.5. But since we’re not sure which model is true, we must account for this uncertainty. Skip the derivation if you like; the answer turns out to be a weighted sum of the individual probabilities. Thus

     Pr(C | M) = Pr(C | E1 & M)Pr(E1 | M) + Pr(C | E2 & M)Pr(E2 | M) + Pr(C | E3 & M)Pr(E3 | M)

or

     Pr(C | M) = 1 * 1/3 + 0 * 1/3+ 1/2 * (1/3) = 1/2.

And that’s the best we can say given only the evidence in M and each of the E. In learning about C, we are well and truly finished.

Now George comes along and he’s either wearing a hat or he isn’t (a tautology). If he is, then what can we say about the individual models? If he isn’t, what then?

If George is wearing a hat (Borsalino snap-brim fedora, fur felt, brown), adding this new evidence to M, we know that the probability of E2 is 0. We have falsified E2. Falsification is rare for models concerning the contingent (physical and counterfactual events like Martians wearing hats). In order to falsify a model the model needs to say an event is impossible—not unlikely, but impossible—and then have the event happen. Model E3, for instance, can only be falsified if we line up all Martians simultaneously and discover that any proportion of Martians not exactly equal to one half wears a hat.

Since George has manned up and is wearing a hat it could be that E1 or E3 was the true model; E2 was false. Each model is, of course, entirely useless to us now, since we have seen George and already know that he wears a hat. Why do we care which is the true model? Well, we don’t really. Maybe we want to hand out a reward to the model that did “the best” in predicting George’s hat status. E2 is out of the money, but E1 or E3 might take the prize.

Well, E1 predicted George would certainly wear a hat and he did. Pretty good performance! But E3 said it was essentially the flip of a coin, which isn’t too bad, though not as good as E1. If you had beforehand bet that George would wear a hat (that C would be true), you would have won using either E1 or E3, but probably won more using E3 since if any bookie actually believed E1 he wouldn’t accept any money except from people who bet that C would be false.

The criterion of goodness is then a subjective matter, depending on what use you put the probability. Different folks can come to different conclusions about model goodness based on different uses.

We can go further, however, and ask a different question. We could turn the probabilities around and ask what the probability a model is true (say E2) given C (George wears a hat) and M (there are three models, etc.). In notation (and using Bayes’s rule and the rule of total probability; these rules are deduced truths from simpler axioms):

     Pr(E2 | C & M) = Pr(C | E2 & M)Pr(E2 | M) /
            [Pr(C | E1 & M)Pr(E1 | M) + Pr(C | E2 & M)Pr(E2 | M) + Pr(C | E3 & M)Pr(E3 | M)]

This follows from Pr(C | M ) = Pr(C|E1 & M)Pr(E1|M) + Pr(C|E2 & M)Pr(E2|M) + Pr(C|E3 & M)Pr(E3|M).

Plugging numbers in let’s us deduce Pr(E2 | C & M) = 0. We also know that Pr(E1 | C & M) = 2/3, and Pr(E3 | C & M) = 1/3. Please note, and please assimilate to the depths of your soul, that these probabilities have nothing to do with future instances (“random trials”) of seeing George. C states that George wears a hat and that is that. Given that we have seen that C is true, C is true and will stay true (given the evidence we have). If C were false (if ~C were true) then we would have falsified E1; E3 would have been just as likely and E2 would have (given the evidence) probability 2/3.

That, dear reader, is finally final. When probability is (rightly) seen as logic, every problem is treated the same. There is no difference between the treatment of these simple models and more complex ones. In Part II, we’ll look at more complex ones.

9 Comments

  1. I object to the use of simple uniform probabilities for the Bayesian probabilities, P(E_i|M)=1/m. We could just as easily add more models, say E_i: P(C|E_i) = 1-i/1000 for i = 1, 2, 3, …, n for n whatever number we want. If n is reasonably small, we can drive the P(C|M) as close to 1 as we want. Similarly, if P(C|E_i) = i/1000, we can drive P(C|M) as close to 0 as we want.

    There is nothing in your arguments to prevent either situation I describe. Clearly, P(C|M) means nothing if it can be arbitrarily close to 0 and 1 (and anything in between). We simply don’t have enough information to assign meaningful probabilities to each model. Just because we have M models does not mean we should assign probability 1/m to each model.

    If we make enough observations of Georges wearing hats or not, we can get reasonably accurate probabilities. The initial Bayesian probabilities become less critical.

  2. Briggs

    Charlie B.,

    Great comment, thanks. The use of these probabilities has been derived and is not arbitrary. It comes from the logical truth of the symmetry of individual constants. It also, serendipitously, is the maximum entropy answer. It also is the principle of indifference answer (not that I am as big a fan of this).

    If your meta information added a new model, then you would update your probabilities accordingly. Each model would a priori have different probability.

    What your perceive as a detriment is actually the consequence of the logical truth that you judge every argument by the evidence provided and nothing more. I have discovered, anecdotally, that this is the hardest thing to get used to.

    Another example, from left field. What is the probability that C = “OJ Simpson is guilty”? There isn’t one. There is no unconditional probability this is true or false. C could be true given some evidence, false given other. How and who supplies the evidence and what evidence supplied—what model is used—is in this case entirely subjective. Logic and probability only comes in after the evidence is given. And we all know that different people have come to radically different probabilities for C, because they call probative radically different sets of evidence.

  3. I’m reading your model_logic paper (and Jaynes’s logical probability book). I’m sympathetic with Bayesian methods. In many engineering problems assuming a prior is reasonable (even if the practitioner doesn’t always realize he or she is doing it).

    Some soft of “uniformity” prior seems most reasonable when having little information, as in your example. I deliberately chose non-uniform models to illustrate that “uniformity” doesn’t necessarily mean all are equally likely. My models are very close together. It doesn’t make sense to give each one the same prior as another model that’s “farther” away.

    If we start with a reasonable prior and make enough observations, then the posterior distribution will depend mainly on the likelihood function. That’s a good thing. We do this a lot in engineering (signal processing): start with a reasonable prior, then keep updating the posterior distribution with each new sample. Over time, our estimates depend less and less on the prior.

    Re your last example, P(C|given all we know) = 0.999 (or some close approximation).

  4. If I wasn’t clear at the end, C={OJ is guilty}, P(C|everything we know) = 0.999. That may be my subjective evaluation, but it’s consistent with the known facts.

    But why can’t C be assigned a probability? Consider a simple experiment: you flip a fair coin and don’t tell me the answer. You know whether it’s heads or tails, so it makes no sense for you to assign a probability. But I don’t know the answer. As far as I know, P(heads)=0.5.

    The same argument holds if there were a thousand mes each of us ignorant of the flip outcome. Having one person know the answer shouldn’t mean it’s not random to the rest of us.

    If you and I were to wager on the outcome, you’d be at a huge advantage. But that’ a different problem.

    BTW, I appreciate your insights and discussions into probability and statistics. I’m sympathetic to Bayesian ideas and am trying to understand their underpinnings. I also tend to favor likelihood methods.

  5. Briggs

    Charlie B.,

    When you supply “everything we know” you estimate Pr(C|”everything we know”) = 0.999. However, when others supply their own versions of “everything they know” Pr(C|”everything they know”) can be different, even, as we know, much less (for example, his jury). The only thing that I’m asking you to “buy” here is that if you change the evidence/model you change the probability.

    Your coin flip is another excellent example. I flip and peek, you do not see. Let C = “Heads”. To you, Pr(C|”everything Charlie knows”) = 0.5. To me, Pr(C|”everything Briggs know”) = 0 or 1 depending on what my peek determined. Same C, two different sets of evidence, two different probabilities.

    Another example: C = “Briggs is wearing a white shirt at the time of writing this.” The probability of C given my observational evidence is extreme (logically true or false), but to you the probability of C might not even be quantifiable: it depends on the evidence/model you supply, which of course is different than mine.

    The only provision I ask you to accept in this part of the essay is that no proposition is not unconditionally true or false; it is only true, false, or probable with respect to explicitly stated evidence. This, of course, is not at all controversial. But statisticians sometimes forget it.

Leave a Reply

Your email address will not be published. Required fields are marked *