What Is A True Model? What Makes A Good One?: Part II

The models in Part I might not have “felt right” to you. But if that is so it is because your diet of probability examples has been too narrowly constricted. This is natural if you’ve received the regular course of statistical training, which consists of meals of data cooked with fixed recipes, but with little chemistry—to stretch a metaphor past the breaking point—about where the recipes arise.

Recall that we wanted know the truth of the proposition C = “George wears a hat.” We can’t know the truth of this, or any, proposition without respect to some evidence. We had three possible models, or three sets of evidence, and these gave us three different probabilities that C was true (this was coincidental; they could have all given the same probability).

But where did the models/evidence come from, how did they arise? I made them up. But that is nothing: the rules of probability work regardless of the provenance of the evidence or models. Logic and probability are only concerned with the connections between statements, not with the statements themselves. This is often forgotten.

Most day-to-day statistical models deal with “data” that is collected in experiments or incidentally. These data are modeled, which is a dangerous shorthand to say that our uncertainty in the values of certain propositions related to the data are quantified given some evidence (like our E). It’s dangerous because the shorthand can be, and sometimes is, used to reify the models. People say “X is normally distributed”, which is a reification of the proper sentence, “Our uncertainty in the values of X is characterized by a normal distribution.” X is caused to take the values it did. Normal distributions do not cause anything.

Even in cases where refication is not suspect, the so-called normal model is used. Where did it arise? Well, we can always use any model we want, just like when we made up the Martian syllogism. If for instance our M(eta evidence) = “The normal model is true” then (via circularity, but still validly) the normal model is true. Because of habit and ease this M is used more than any other. This does not mean that EN is always the best or most useful or that given other M is EN even probable.

It is often possible to deduce a true model given simpler known-to-be true facts. Suppose we know that there are N balls in an urn. It’s always balls in urns; but use your imagination to substitute other examples. The N balls can be labeled only “0” or “1” (red or blue, success or failure, etc., etc.). We do not know how many of the N are 0 and how many 1; it could be that none are 0 or none are 1.

The proposition of interest—which I am making up, as we make up all propositions of interest—is C = “There are M 1s” where I will substitute a number between 0 and N for M. I could, say, choose M = 1000 > N, but then I would be able to deduce, given the evidence, that C is false. Suppose I wanted C = “There are no 1s” (i.e. M = 0).

Given the evidence we have—and accepting no other: a key proscription to remember—we use the statistical syllogism and deduce that the probability C is true is 1/(N+1). Perhaps this is intuitive: there are N balls and thus N + 1 chances for 1s (none, just 1, just 2, etc.). This model is the true model given the evidence that there are N balls and they can be labeled only 1 or 0.

Once we are comfortable writing out the full evidence and probability statements we can use simplifications, like saying “the probability M = 1 is 1/(N+1).” Or we might say that “M has a uniform distribution.” For now, we stick with the original language which is cumbrous but ever accurate.

Now suppose we take out n1 + n0 = n < N balls and notice that n1 are labeled 1 and n0 are labeled 0. This new evidence can and should modify our model about the remaining N – n balls. There’s some math involved (see this paper), but the deduced model about our uncertainty in the number of 1s left is called a negative hypergeometric or beta-binomial. If we take a out still more balls, leaving some in the urn, the probability that the remainder are labeled 1 is still the same model (but updated to account for our new information about the labels on these new balls).

Once we have taken all balls out, the probability any remaining ones being labeled 1 or 0 is 0: this is deduced given our model. We obviously no longer need the model for future use, just as we no longer needed a model after we saw George wearing a hat.

Incidentally, those familiar with statistical lingo will note the complete absence of (continuously valued) parameters, little Greek letters that are usually necessary to full specify a model. Parameters weren’t needed in the Martian hat example either. We don’t need them because everything is written in terms of observable evidence. The urn example can show us how parameters arise, interestingly.

If we let N grow large, towards the limit, then the distribution which characterizes our uncertainty in the number of remaining 1s is still beta-binomial but suddenly parameters are present which take the place of observational evidence. We could have just said “N will be large” and used the beta-binomial with continuous-valued parameters to start, by taking “priors” on the parameters and so forth, but these actions would be an approximation to the model we deduced as true when N was finite—and N will always will be finite for real-world examples.

Ideally, all statistical models should begin the “urn” way: stating what is finitely observable and working out the math for the finite case, only taking the limit at the end to see if useful approximations could be made (Jaynes warned us about the dangers of misapplying limits prematurely). It would also end arguments about the influence of “priors.” Priors wouldn’t be needed, except as the arise as the deduced natural limits of properly described observational processes.

Update How this all relates to climate models is coming!

15 Comments

  1. DAV

    Discrete is nice unfortunately nature provides data that are rarely package as neat little balls. Not a bad idea to think of the data as such though.

    I’m not particularly sure how the parameters magically appear as N gets large. It looked to me that you were using them from the start and updating them on every draw. If not, could you please clarify.

  2. Briggs

    DAV,

    Actually, nature gives nothing but neat little balls. We impose the continuous-valued math.

    The (barely readable) paper gives the details of how parameters arise for this model. It would be too much to go into here. I’m not happy with this paper and should probably re-write to eliminate all the extra words.

  3. DAV

    Briggs,

    I suppose nature does give us little balls of data but, just like dealing with grains of sand, handling them by the bucket is lots more convenient than pawing the individual grains. Unfortunately, that’s also the source of the bigger headaches — at least with data.

  4. DAV

    OK, I see where my confusion arose. I start with the beta (more often, Dirichlet) distribution assuming N is going to eventually get large enough. I initially set the parameters to 1,1 and update them with counts from the samples. It looked like you were basically doing the same thing.

    Nothing wrong with the paper (I think you know that).

  5. Dr. Briggs,

    We need a nomenclature that clearly illuminates the extremely significant differences between a stats “model” and all those actual models that are based on fundamental mathematical Laws of Nature. Phenomenological and mechanistic approaches, typically encountered in Engineering Science, and based on the fundamental Laws and / or empirical observations, are a second class of actual models. Causality is the foundational hallmark of all actual models.

    Stats “models” are at the same level of ad hoc-ness, and as divorced of causality, as the parameterization-ladened “models” encountered in Climate Science.

    A good model will be based on the fundamental Laws and will invoke causal spatial and temporal scales so as to encompass the response of interest in one functional expression having a single parameter that is to be determined from observations. And the magnitude of the parameter will be of the order of unity.

    We need some adjectives to pre-pend to the word “model” whenever the subject is in fact causality-free stats “models”.

  6. JH

    Why using the word model or evidence? The word premise or assumption works beautifully… in mathematics and statistics.

    The proposition of interest—which I am making up, as we make up all propositions of interest—is C = “There are M 1s” where I will substitute a number between 0 and N for M. I could, say, choose M = 1000 > N, but then I would be able to deduce, given the evidence, that C is false. Suppose I wanted C = “There are no 1s” (i.e. M = 0).

    Let me rephrase this. IF A={ there are N balls in an urn, where N 1000, THEN B ={“There are M 1s in the run” is false }.
    Given the assumption A, then B holds.

    Yes, you have defined the well-known probability distribution that can be employed in statistical modelling. So What is a true (statistical) model?

    Again, there are differences among true, assumed to be true, believed to be true… and accepted to be true.

    That Buddhists believe in reincarnation doesn’t mean the belief is true. IF reincarnation is true, then my brothers and I probably would be pigs in the next life, at least according to Grandma since we were naughty. Assuming reincarnation (is true) is no evidence (that it’s true).

    I understand that the choice of a prior distribution is accepted , more or less agreed upon by a group of people based on the evidence; hence it’s called objective Bayesian analysis. The choice is not believe to be true, is it? At best, the choice can be seen as an approximate to the truth.

  7. JH

    I’m going to stop using boldface.

  8. Briggs

    DAV,

    Quite, quite true that handling them in bulk easier than by individuals. But, and I know you would agree, that just because a thing is easy does not make it right. It does turn out, of course, that in this case individually or by bucket gives the same answer. But it doesn’t always turn out so. And it is in those cases where we have to be more careful. If you have it, Chapter 15 (I’m going from memory as I’m still on the road) in Jaynes is a wonderful cautionary tale.

  9. Briggs

    Dan Hughes,

    I come to this discussion at the end. But stats models, as I hope I have shown here, are not always ad hoc. Many are, though, and their number is legion.

    But let me answer one of your comments, the “fundamental mathematical Laws of Nature.” Just what are these? If you mean deduced theorems from simpler accepted-as-true axioms, then I am with you. These are models in just the same sense I use them. If you mean, seen-to-be-true-in-most-cases physical predictions, then you are back into the realm of probability. But that’s a model too.

    Climate models are surely models, but their predictions are not very good. This doesn’t make them not-models, however.

  10. Briggs

    JH,

    Very good.

    (1) Just what are the differences among “true”, “assumed to be true”, “believed to be true”, and “accepted to be true”? I think I am close to convincing you, especially given your reincarnation example. But ignore that and answer this question.

    (2) If you don’t like my nomenclature on what is a model—but I don’t actually see you claiming it is false. If you do say it is false, then what is your definition?

  11. Briggs

    DAV,

    You are too kind, re: the paper. It was supposed to be a teaching tool, and therefore should have been clear and easy to read. It failed on both these accounts.

  12. JH

    Ideally, all statistical models should begin the “urn” way: stating what is finitely observable and working out the math for the finite case, only taking the limit at the end to see if useful approximations could be made (Jaynes warned us about the dangers of misapplying limits prematurely). It would also end arguments about the influence of “priors.” Priors wouldn’t be needed, except as the arise as the deduced natural limits of properly described observational processes.

    Think of our real-life experience. What it means is that the influence of our prejudice, i.e., the subjective choice of a prior, would be diminished in our decision making if we have vast amounts of (good) data evidence/information at hand, which we don’t normally have in reality. With limited information, our prejudice would influence our decisions.

    No, it doesn’t mean it would end arguments about the influence of priors/prejudice. Such a Xmas wish won’t be granted.

  13. JH

    Dear Mr. Briggs,

    Please note that I do my best to address your questions point by point.

    Must I tell you the differences among them? I thought I did. I guess I wasn’t clear, but I am not going to do it again.

    No, I didn’t say that your nomenclature on “what a model is” is false. You can name or define it whatever way you want. I can’t say it’s false, but I don’t like it. After reading Part I, it seems the commonly used term premise/assumption will work better, just my opinion.

    I was hoping that you would answer the question by telling me the advantages of using the words model/evidence over premise/assumption.

    Well, just as I had hoped that you would give me straightforward answers to my questions about statistical modeling here and here. Point by point.

    I don’t need to show whether A is true empirically or abstractly when I want to prove or disprove a conditional statement of the form “If A then B,” where A is the premise/assumption and B is the conclusion. I work under the assumption of A, and then make logical conclusions accordingly or demonstrate whether B follows logically. In this case, I see no reason that I should assess the probability of A being true empirically or abstractly. We are neither discussing what a statement knowable a priori is nor making up probability exam problems for an intro class.

  14. Briggs

    JH,

    You did not make clear what the difference between “true” or “accepted as true” etc. are. If you are “not going to do it again” then we are at an impasse.

    Your question about nomenclature—the least interesting part of this discussion—is answered partly in Part III, partly in Part IV, and finally in Part V.

    For your other comment: you have misunderstood. Using the finite, discrete approach, there is no “prior” as that word is used in its technical sense. I never claimed, and do not claim, that using this method of statistics will eliminate prejudice.

  15. JH

    OK, I did misunderstand you.

    OK, irrelevant point, least interesting question, unclear differences among those terms, … all words for silencing a discussion. Gotcha!

Leave a Reply

Your email address will not be published. Required fields are marked *