Model Selection and the Difficulty of Falsifying Probability Models: Part I

These next posts are in the way of being notes to myself.

Logic is the study of the relation between statements. For example, if “All green men are irascible, and Bob is a green man”, then we know that “Bob is irascible” is certainly true. It isn’t true because we measured all green men and found them irascible, for there are no green men; it’s true because syllogisms like this produce valid conclusions.

We know that “there are no green men” because we know that “all observations of men have produced no green ones”, and we know that based on further evidence, extending in a chain to the a priori. As it is not necessary for what follows, this proof is left for another day.

Probability is no different than logic in that it is the study of the relation between statements. Given the premises assumed above, the probability the conclusion is true is 1. Modify the first word in the first premise (“All”) to “Most”, then the probability of the conclusion is less than 1 and greater than 0 (the range depending on the definition of “Most”).

In either case, logic or its generalization probability, we cannot know the status of a conclusion without reference to a specific set of premises. We cannot know the probability of so simple a statement as “This coin will show a head when tossed” without reference to some set of premises—which might include observational statements. Thus, all probability is, just as all logic is, conditional.

This background is necessary to emphasize that we cannot know whether a given model or theory is true—regardless if that model is wholly probabilistic, deterministic, or in between—without reference to some list of premises. In classical (frequentist) statistics, that premise is (eventually) always “Model M is true”, therefore we know with certainty, given that premise, that “Model M is true”. This premise is usually adopted post hoc, in that many models may be tried, but all are discarded except one.

The “p-value” is the probability of getting a larger (in absolute value) ad hoc statistic than the one actually observed given the premises (1) the observed data, (2) a statement about a subset of the parameter space, and most importantly (3) the model, which is assumed true. If the model is false, the p-value still makes sense because, like our green men, it only assumes the model is true. “Making sense” is not to be confused with being useful as a decision tool. It makes sense in just the same way our green men argument makes sense, but it has no bearing on any real-world decision.

Importantly, despite perpetual confusion, the p-value says nothing about whether (3) the model is true; nor does it saying anything about (2) whether the statement about the subset of the parameter space is true. The theory or model is always assumed to be true: not just likely, but certain.

I leave aside here the argument that a theory leads to a unique model: my claim is that the two words are synonymous. Whether or not this is so, a model is a unique, fixed construct (e.g., every addition or deletion of a regressor in a regression is a new theory/model). The ad hoc statistic or hypothesis test of frequentist statistics forms part of the theory/model (in this way, there are always two theories under frequentist contention, with one being accepted as true, the other false).

In Bayesian statistics, there is a natural apparatus for assessing the truth of a model. There is always the element of post hoc model selection in practice, but I’ll assume purity for this discussion. If we begin with the premise, “Models M1, M2, …, Mk are available”, and joined it with “Just one model is labeled Mi“, then the prior probability “Model Mi is true” given these premises is 1/k. It is important to understand that if the premise were merely “Model Mi is either true or false”, then the probability “Model Mi is true” is greater than 0 and less than 1, and that is all we can say. This makes sense (and it is different from the frequentist assertion/premise that “Model Mi is true”) because again all logic/probability is concerned with the connections between statements, not the statements themselves (this is the major mistake made in frequentism).

That last assertion means that the list of models under contention is always decided externally; that is, by premises which are unrelated to whether the models are true, or even good or useful. There might be some premise which says, “Given our previous knowledge of the subject at hand, these models are likely true”; that premise might go on to assign prior probabilities different than 1/k for each model under consideration. But it is of the utmost importance to understand that it is we who close the universe on acceptable models. In practice, this universe is always finite: that is, even though we can make statements about them, we can never consider an infinity of models.

In Part II, model selection and what falsifiability is.


  1. Big Question for you William:

    Since you don’t like p-values (or at the very least, they way they’re typically used), what alternative analysis approach do you propose to classical ANOVA (a special case of GLM)? Do you have a problem with the classical approach to the partitioning of sum of squares (other than that it is hard to do correctly)? Or do your concerns go no further than the use of p-values?

    (Assume please a well fashioned experimental design, also difficult to do correctly).

    Thanks in advance.

  2. Mike B,

    Excellent question, which I hope you don’t mind me deferring because I plan on answering it in the context of this post.

    If people used p-values as they are meant to be used, they wouldn’t like them, either. They’re only retained because they make decisions easier: not necessary correct, just easier. I also do not like hypothesis testing. Your point about a well-fashioned experimental design, i.e. control, is key. Even the key.

  3. Thanks. Looking forward to it as always.

    I also hope that you will address the issues of appropriate partitioning of the sum of squares, as I’ve many, many cases where engineers, scientists, and even (gasp) statisticians were fooling themselves with these really small p-values because they were using an error term that was innapropriate (and too small) for the effect they were testing.

    In an earlier post you (almost completely) pooh-poohed the entire concept of randomization, which I also hope you’ll address again within this larger context.

    Good stuff to come, no doubt!

  4. OK, I’m confused.

    I thought models were not simply assumed to be true. Rather, isn’t the goal to determine if the model is true or false?

    I had thought the advantage of the Bayesian view is that it allows for a complete range of perspective on the model, from 0-100%, whereas the other view says it is either correct or disproven… and p values are an indicator of which.

    If this is not the case, then whatever happened to falsifiable hypotheses? If nothing is ever proven false… seems I’m missing something here!

  5. “Be less certain” was your suggestion for the recent ‘cognitive toolkit’ post. Yet here you are defending the thesis that models must be either true of false. Have you considered the possibility that models may be neither true nor false? I see models (and scientific theories) as tools – they may be fit for the job in hand or not, but they are no more true or false than is a spanner or a screwdriver.

  6. This is how I see it. Mathematically, proofs and calculations are performed under the assumption that the premises (the underlying models) are true. While classical analysis assumes the null model is true when calculating the p-value, Bayesian analysis computes the posterior probability of a parameter assuming a certain prior distribution is true. Which is quite different from saying whether a model is useful.

    Just as in real life, we often make a judgment based on what should have happened if something is true. If the evidence points in another direction, then it’s probably not true.

  7. Stephen P:

    Models ARE either true or false. There are no other options… a false model can still be useful, though. Kind of like using a coin as a screwdriver, perhaps?

  8. Adam H

    If, as you say, models are true or false, and no other options allowed, could tell me what status you ascribe to models that have not yet been falsified? Are they True or False?

  9. Oh… let me add the following.

    Define X to be the number of heads observed when tossing a coin 10 times. It does make sense to say that fitting a normal distribution model to the random variable X is wrong and the model is false. I see many practitioners fit a linear model incorrectly or falsely to integer-valued responses. However, when it comes to the statistical regression model of real data, a statistician would be interested in whether an appropriate model is useful, hence the topic of model selection.

  10. Stephen P:

    I do not know whether they are true or false; however, they are still either true or false. Are there any mugs on my desk? You do not know, yet the answer is either yes or no.

    I suppose we are debating philosophy at this point and I certainly am not up to the challenge. I’ve never understood why theories like schrodingers cat are legitimately entertained… but many smart people will agree with you on this so I’ll leave it at that.

  11. @StephenP

    A model (theory) either predicts all relevant observations correctly, or it does not. It is true if and only if it predicts all observations correctly, and it is false otherwise.

    At any given time there is are two sets of known competing models, one set with models that have predicted a subset of all observtions correctly, and the other set with models that failed at least once for that subset.

    The second set are models that are known to be false. The first set contains models that are not known to be false.

    Because not all possible observations have been made, in particular no observations in the future, it is not yet known which models that are in the first set are false. There can however be at most one true model of all the competing models in the first set.

  12. @Sander van der Wal

    A very clear and concise what of putting things!

    However, I do not understand why you felt the need to include the last sentence: “There can […] be at most one true model of all the competing models in the first set.” Who cares, since there is no way to confirm it? (There will always be future observations.)

  13. @ Adam H
    OK, let’s leave it there. I hope that there IS a mug on your desk and that it is filled with your favorite brew.

    @ Sander van der Wal

    Your statement of the criteria for deciding which models are true and false is very clear, but it does raise a few questions:

    First, is it really possible to be sure that a model has made a correct prediction? If the prediction only concerns the number of mugs on a desk, then I have no problem, but often the prediction will be a non-integer numerical value. What standard should be applied concerning numerical accuracy of the model and accuracy of measurement of the observation? This is the main reason I find a true/false judgment problematical. I would much more comfortable saying that the model is characterized by a particular degree of accuracy and has a certain range of application.

    Second, it would seem that at any given time (e.g. now) it is impossible to know if any of the good (i.e. unfalsified) models is true or false. I don’t see the relevance of the true/false criterion if it cannot be applied, i.e. to decide which of two competing unfalsified models should be used. If the true/false criterion has no predictive value, it is not scientific, and Occam’s razor says it should go.

    Third, the idea that there can only be “one true model” sounds rather theological! Is there any evidence for this? As a counter example, athough it is not my field, I understand that there are several alternative mathematical models of general relativity that all give equally good results

  14. @StephenPickering

    1) It depends on the model and the accuracy of the measurements. Observe enough different situations and all wrong models will start to make false predictions.

    2) If you have more than one good model, then it will happen at some point that they will make different predictions for a certain situation. Whether that is important to somebody in particular is a different question. If you are going to use the models for well-known situations in which the models are known to be equivalent, you can use any model. If you venture outside the well-known situations and the models start to make different predictions, then you can start to check which models make the right prediction.

    3) Let’s assume that there is more than one true model. All these models will give identical predictions.
    Assume these models are formula-based (like E = M * C-squared). The used formula’s are different, otherwise the models would be identical. Different formulae will give different answers at some point, unless they are equivalent (like E = C * M * C = M * C-squared).

    More general, a model is a mapping of specific input states to specific output states (which are observed). All true models will have the same mapping of specific input states to output states. That means there is no way to tell the difference between the different true models, and that makes them the same.

  15. @ Sander van der Wal

    I was trying to say the same thing concerning accuracy and range of application (e.g. interpolation vs extrapolation) as you mention in your points 1 and 2, so I think we agree concerning the way models should be used in practice. As far as I can see, we differ only in that you put the label ‘false’ on models that I would label ‘for limited use only’. So be it.

    On the other third point, I agree that it might well be possible to show that some models with different formulae are mathematically equivalent – and that they are therefore essentially the same model. However, I am not sure that it will always be the case, and so the identical mapping of inputs to outputs may not mean that there is never any way to distinguish between ‘true’ models. After all, models are not black-boxes – we can also look inside and distinguish them according to the way they work. I would think, for example, that it is possible to build a Monte-Carlo model that can give the same result and to the same degree of accuracy as a formula-based model. I would regard those as different models.

  16. @StephenPickering

    Would you consider two differently named functions with the same mapping the same function, or different functions?

    For example, sin(x) can also be written in as it’s Taylor expansion. Do we now have two different functions with the same mapping, or one function that can be written in different ways?

  17. @ Sander van der Wal

    For the example you give, if a function can be written in two different ways, then for me it is the same function, same model. It is no different from the case of two different computer program implementations of a particular model – it is still the same model.

    No, I was thinking of (possibly hypothetical) cases where we cannot identify a function that is common to two models that have the same mapping. Now this the interesting bit. Does that mean that no such function exists, or simply that we haven’t discovered it yet? I suppose if you believe in the “one true model”, then there must be some common function, or some demonstration that the different functions in the two models are equivalent, waiting to be discovered. I am agnostic on that issue.

  18. @StephenPickering

    One can always think of a new name.

    How about this? A river can be modelled by a function, or by a physical scale model. Would you consider these two different models, given they are both true models? I would.

Leave a Comment

Your email address will not be published. Required fields are marked *