Phil Dawid is a brilliant mathematical statistician who introduced (in 1984) the theory of *prequential probability*^{1} to describe a new-ish way of doing statistics. We ought to understand this theory. I’ll give the philosophy and leave out most of the mathematics, which are not crucial.

We have a series of past data, x = (x_{1}, x_{2}, …, x_{n}) for some observable of interest. This x can be quite a general proposition, but for our purposes suppose its numerical representation can only take the values 0 or 1. Maybe x_{i} = “The maximum temperature on day i exceeds W^{o} C”, etc. The x can also have “helper” propositions, such as y_{i} = “The amount of cloud cover on day i is Z%”, but we can ignore all these.

Dawid says, “One of the main purposes of statistical analysis is to make *forecasts for the future*” (emphasis original) using probability. (It’s only other, incidentally, is explanation: see this for the difference.)

The x come at us sequentially, and the probability forecast for time n+1 Dawid writes as Pr(x_{n+1} | x^{n}). “Prequential” comes from “*pro*bability forecasting with *sequential pre*diction.” He cites meteorological forecasts as a typical example.

This notation suffers a small flaw: it doesn’t show the model, i.e. the list of probative premises of x which must be assumed or deduced in order to make a probability forecast. So write p_{n+1} = Pr(x_{n+1} | x^{n}, M) instead, where M are these premises. The notation shows that each new piece of data is used to inform future forecasts.

How good is M at predicting x? The “weak prequential principle” is that M should be judged only on the p_{i} and x_{i}, i.e. only how on good the forecasts are. This is not the least controversial. What is “good” sometimes is. There has to be some measure of closeness between the predictions and outcomes. People have invented all manner of scores, but (it can be shown) the only ones that should be used are so-called “proper scores”. These are scores which require p_{n+1} to be given conditional on just the M and old data and nothing else. This isn’t especially onerous, but it does leave out measures like R^2 and many others.

Part of understanding scoring is calibration. Calibration has more than one dimension, but since we have picked a simple problem, consider only two. Mean calibration is when the average of the p_{i} equaled (*past tense*) the average of the x_{i}. Frequency calibration is when whenever p_{i} = q, q*100% of the time x = q. Now since x can only equal 0 or 1, frequency calibration is impossible for any M which does produce non-extreme probabilities. That is, the first p_{i} that does not equal 0 or 1 dooms the frequency calibration of M.

*Ceteris paribus*, fully calibrated models are better than non-calibrated ones (this can be proven; they’ll have better proper scores; see Schervish). Dawid (1984) only considers mean calibration, and in a limiting way; I mean mathematical limits, as the number of forecasts and data head out to infinity. This is where things get sketchy. For our simple problem, calibration is possible finitely. But since the x are given by “Nature” (as Dawid labels the causal force creating the x), we’ll never get to infinity. So it doesn’t help to talk of forecasts that have not yet been made.

And then Dawid appears to believe that, out an infinity, competing mean-calibrated models (he calls them probability forecasting systems) are indistinguishable. “[I]n just those cases where we cannot choose empirically between several forecasting systems, it turns out we have no need to do so!” This isn’t so, finitely or infinitely, because two different models which have the same degree of mean calibration can have different levels of frequency calibration. So there is still room to choose.

Dawid also complicates his analysis by speaking as if Nature is “generating” the x from some probability distribution, and that a good model is one which discovers this Nature’s “true” distribution. (Or, inversely, he says Nature “colludes” in the distribution picked by the forecaster.) This is the “strong prequential principle”, which I believe does not hold. Nature doesn’t “generate” anything. Something *causes* each x_{i}. And that is true even in the one situation where our best knowledge is only probabilistic, i.e. the very small. In that case, we can actually deduce the probability distributions of quantum x in accord with all our evidence. But, still, Nature is not “generating” x willy nilly by “drawing” values from these distributions. Something *we-know-not-what* is causing the x. It is our knowledge of the causes that is necessarily incomplete.

For the forecaster, that means, in every instance and for any x, the true “probability distribution” is the one that takes only extreme probabilities, i.e. the best model is one which predicts without error (each p_{i} would be 0 or 1 and the model would automatically be frequency and mean calibrated). In other words, the best model is to discover the cause of each x_{i}.

Dawid also has a technical definition of the “prequential probability” of an “event”, which is a game-theoretic like construction that need not detain us because of our recognition that the true probability of any event is 0 or 1.

**Overall**

That models should be judged ultimately by the predictions they make, and not exterior criteria (which unfortunately includes political considerations, and even p-values), is surely desirable but rarely implemented (how many sociological models are used to make predictions in the sense above?). But which proper score does one use? Well, that depends on exterior information; or, rather, on evidence which is related to the model and to its use. Calibration, in all its dimensions, is scandalously underused.

Notice that in Pr(x_{n+1} | x^{n}, M) the model remains fixed and only our knowledge of more data increases. In real modeling, models are tweaked, adjusted, improved, or abandoned and replaced wholesale, meaning the premises (and deductions from the same) which comprise M change in time. So this notation is inadequate. Every time M changes, M is different, a banality which is not always remembered. It means model goodness judgments must begin anew for every change.

A true model is the one that generates extreme probabilities (0 or 1), i.e. the identifies the causes, or the “tightest” probabilities deduced from the given (restricted by nature) premises, as in quantum mechanics. Thus the ultimate comparison is always against perfect (possible) knowledge. Since we are humble, we know perfection is mostly unattainable, thus we reach for simpler comparisons, and gauge model success by it success over simple guesses. This is the idea of skill (see this).

Reminder: probability is a measure of information, an epistemology. It is not the language of causality, or ontology.

—————————————————————————–

*Thanks to Stephen Senn for asking me to comment on this.*

^{1}The two papers to read are, Dawid, 1984. Present position and potential developments: some personal views: statistical theory: the prequential approach. *JRSS A*, 147, part 2, 278–292. And Dawid and Vovk, 1999. Prequential probability: principles and properties. *Bernoulli*, 5(1), 125–162.