The Predictive Way Is The Only Way In Statistics: Sick Cow Edition

What is the purpose of modeling? There can be only two.

The first is to say what happened, to explain. But in order to do this one must first assume what one wants to show. If you want your model to show that the drug cured more people than the placebo, you must assume this is true.

Just think: any data collected will show either that the placebo or the drug cured more patients, conditional on whatever other measures or subsets of other measures that you like. We can stop right there! If the data shows the drug cures more or less (in whatever subsets) we are done. A model may, it’s true, invert this observation, but only if you build this reversal into it. Models can only show what they are told to show.

The explanatory model can thus only be confirmed when its makes skillful predictions, which is the second and true purpose of any model: prediction. Buy this award-eligible international seller book for more on this important subject.

Enter sick cows.

In the R package lme4, the dataset cbpp, taken from the paper “Within-herd spread of contagious bovinepleuropneumonia in Ethiopian highlands” by Lesnoff and others, in Preventive Veterinary Medicine. From the data description:

Contagious bovine pleuropneumonia (CBPP) is a major disease of cattle in Africa, caused by a mycoplasma. This dataset describes the serological incidence of CBPP in zebu cattle during a follow-up survey implemented in 15 commercial herds located in the Boji district of Ethiopia. The goal of the survey was to study the within-herd spread of CBPP in newly infected herds. Blood samples were quarterly collected from all animals of these herds to determine their CBPP status. These data were used to compute the serological incidence of CBPP (new cases occurring during a given time period). Some data are missing (lost to follow-up).

These are incidence counts in 15 different herds at four consecutive time period (n = 56). Mixing code and output:


head(lme4::cbpp)
  herd incidence size period
1    1         2   14      1
2    1         3   12      2
3    1         4    9      3
4    1         0    5      4
5    2         3   22      1
6    2         1   18      2

For example, herd 15 started in period 1 with 19 head and had one infection, and at period 2 there were only 15 head and 1 new infection, and so on.

Now we could look at the infection rate at each time period and for each herd and report on that. Conditional on the belief that this disease is the only cause of the observations we are done. Some herds at different periods have different infection rates than others. We have no measurements and no idea why the differences exist. We can only deduce that differences must exist. Which is to say we have no assumptions about causes of these differences, but we know different causes exist.

What good would a model do now except to tell its what we have already seen? No, we never need a model to show us what happened. All we have to do is look. We only want to model to make predictions about other herds and periods that are like ours.

What do I mean by like? Only this: that the as-yet-unmeasured herds and periods are subject to the same, and only the same, causes as the herds and periods we have already measured. This is yet another assumption, a prior if you will, that we bring to the problem.

Very well, let’s assume new herds and periods will be like ours in this casual sense. Then we are still not done with our casual assumptions. Is each herd the same? I mean is herd 1 the same as herd 2 and so on? The veterinarians didn’t think so. They picked a model to let the herds be different. So we will too. And if herds are different, so are periods.

Here is the model in code form, take from this source and suitably modified:


library(rstanarm)

fit <- stan_glmer(cbind(incidence, size - incidence) ~ size + period + (1|herd),
             data = lme4::cbpp, family = binomial, QR = TRUE)

This is a standard linear binomial model, where the intercept is varied by herd, an acknowledgement of the assumption that different herds have different causes operating on them.

We can look at the model parameters (code and output mixed):


summary(fit)

Estimates:
                                      mean   sd     2.5%   25%    50%    75%    97.5%
(Intercept)                           -1.6    0.6   -2.8   -2.0   -1.5   -1.1   -0.4 
size                                   0.0    0.0    0.0    0.0    0.0    0.0    0.1 
period2                               -1.0    0.3   -1.6   -1.2   -1.0   -0.8   -0.3 
period3                               -1.1    0.4   -1.8   -1.4   -1.1   -0.9   -0.4 
period4                               -1.6    0.5   -2.6   -1.9   -1.6   -1.3   -0.7 
b[(Intercept) herd:1]                  0.6    0.4   -0.2    0.3    0.6    0.9    1.5 
b[(Intercept) herd:2]                 -0.4    0.4   -1.3   -0.6   -0.3   -0.1    0.4 
b[(Intercept) herd:3]                  0.4    0.4   -0.3    0.1    0.4    0.6    1.1 
b[(Intercept) herd:4]                  0.0    0.5   -1.0   -0.3    0.0    0.4    1.0 
b[(Intercept) herd:5]                 -0.3    0.4   -1.1   -0.5   -0.2    0.0    0.6 
b[(Intercept) herd:6]                 -0.5    0.5   -1.4   -0.8   -0.4   -0.2    0.4 
b[(Intercept) herd:7]                  1.0    0.4    0.1    0.7    0.9    1.2    1.9 
b[(Intercept) herd:8]                  0.5    0.5   -0.5    0.2    0.5    0.9    1.5 
b[(Intercept) herd:9]                 -0.3    0.6   -1.4   -0.6   -0.2    0.1    0.8 
b[(Intercept) herd:10]                -0.6    0.4   -1.6   -0.9   -0.6   -0.3    0.2 
b[(Intercept) herd:11]                -0.2    0.4   -1.0   -0.4   -0.1    0.1    0.6 
b[(Intercept) herd:12]                -0.1    0.5   -1.1   -0.4   -0.1    0.3    1.0 
b[(Intercept) herd:13]                -0.8    0.5   -1.8   -1.1   -0.8   -0.5    0.1 
b[(Intercept) herd:14]                 1.0    0.5    0.2    0.7    1.0    1.3    2.0 
b[(Intercept) herd:15]                -0.6    0.5   -1.7   -0.9   -0.6   -0.3    0.3 
Sigma[herd:(Intercept),(Intercept)]    0.6    0.4    0.2    0.4    0.5    0.8    1.6 

Use summary(fit, digits=3) to get better resolution; too many numbers here look ugly on this screen.

These are the parameter posteriors found using the priors assumed in the model. There is nothing special about them, and neither is there anything curious.

What about the conjugate normal priors on the parameters? Well, change the ad hoc chosen-for-convenience prior, change the posterior. Feel free to change this one to see how it affects the results. But also, change the ad hoc chosen-for-convenience model, change the posterior. There is no mystery here! All probability at all times is conditional on the assumptions we make. Nothing has a probability. Probability is a state of mind deduced from assumptions. (A secret is that weird premise of infinite parameters and observations, which are impossible in practice, and which we'll see later.)

Now these posteriors are in the form of log odds, a screwy thing to think of. Of course, they can easily be transformed into odds by exponentiation. Still weird to think of, though, because these would not be odds of the observable, but of some strange multiplier. In any case, it would be a capital mistake to assume the certainty we have in these parameters is equal to the certainty of the observable, which is the only uncertainty we care about. Repeat that last sentence before continuing.

We have to pick a herd. How about 15 at period 1? We know with certainty that the incidence number was 1 with herd size 19. We are done! Unless we want to say something about as yet unseen herds that are causally like herd 15.

So let's look at the predictive posterior:


newdata = data.frame( herd = "15", size = 19, incidence = 1, period = "1")
p = posterior_predict(fit,newdata)
  
plot(table(p)/length(p),ylab='Probability', xlab='Incidence', main='Like Herd 15, Period 1')

Herd size was 19, so the maximum possible number of incidents is also 19. The probabilities, conditional on all the assumptions and data we have made so far, of every possible incident number are in this pic. Incidence numbers greater than 9 are pretty rare. These are not the probabilities: they are the probabilities conditional on all the assumptions we made to this point.

And we're done again! That picture is the answer.

"But what about the prior?" Well, what about the model? Okay. Change the prior above to see what a different prior does. If you're not sure which to use, then use both! Combine the models.

Like this:


  
fit2 <- stan_glmer(cbind(incidence, size - incidence) ~ size + period + (1|herd),
       data = lme4::cbpp, family = binomial, QR = TRUE, prior_intercept = student_t(4, 0, 10, autoscale = FALSE))

# 4 is degrees of freedom; i.e. a wider prior

p2 = posterior_predict(fit2,newdata)
  
# examine the new model
plot(table(p2)/length(p),ylab='Probability', xlab='Incidence', main='Like Herd 15, Period 1, Model 2')

# combined
plot(table(c(p,p2))/length(c(p,p2)),ylab='Probability', xlab='Incidence', main='Like Herd 15, Period 1\nCombined Model')

This combination is based on the statistical syllogism, with is the deduction based on the premise that we have two models and only two, therefore the prior probability of either is deduced as 1/2. Why this prior? No reason, except that it's one that's in common use. Why this model? No reason, except that it's one that's in common use.

What if we wanted to look at the same herd at period 2? I mean herds casually similar to ours (15) at period 2. We can, of course, but it requires more assumptions. From period 1 to 2 herd 15 lost 4 head. There was only 1 infection in period 1, which we can assume led to the cow's culling or death. But what about the other 3? These losses had some cause or causes--- which we must assume will be the same for our as yet unmeasured herd that is causally like 15!


# increase iterations to get better convergence
fit <- stan_glmer(cbind(incidence, size - incidence) ~ size + period + (1|herd),
             data = lme4::cbpp, family = binomial, QR = TRUE, iter=10000)

newdata = data.frame( herd = "15", size = 15, incidence = 1, period = "2")
p = posterior_predict(fit,newdata)

plot(table(p)/length(p),ylab='Probability', xlab='Incidence', main='Like Herd 15, Period 2', lwd=7, xlim=c(0,8))
  axis(1,at=0:8,labels=0:8)

newdata = data.frame( herd = "15", size = 18, incidence = 1, period = "2")
p = posterior_predict(fit,newdata)

lines(table(p)/length(p), col=2)

I reuse the p, and rebuild the model for better convergence (recalling there is no magic or "randomness" in MCMC methods; they are just numerical approximations to integrations). It's still a bit off for the smallest probabilities. Increase it yourself for tighter results.

Obviously, the probabilities for number of incidents change because the assumption on herd size changes. Which is "the" assumption to use? There is none. It depends on the situation you think will obtain in the future.

"I'm still concerned about the prior. I'm a frequentist and priors make me nervous."

You're not concerned about the model, which is vastly more influential than the prior assumption, because as a frequentist you believe things "have" probabilities. That is, after all, what frequentist theory insists. Things don't have probabilities. All probability is deduced based on the assumptions we make. The ideal situation, which is not this one, and that observation is key, is when we start with true premises and deduce a model.

Here, we start with ad hoc premises based on past experience. These suggest a model, which may or may not be any good, and all models which use infinite parameters must have priors.

"But how do I tell if the model and priors are any good?"

You can't. Not now. There is no way to tell for certain, just as there is no way to tell for certain in any frequentist analysis. The only way to know is to use the model to make predictions of as-yet-unseen observations. If these predictions are useful---where useful depends on more assumptions which can vary from person to person---then the model is good. If not, then not.

Simple as it is, that is the answer.

To support this site and its wholly independent host using credit card or PayPal (in any amount) click here

9 Thoughts

  1. Two points:

    1. You seem to have skipped over the use of statistics as part of a process. Two forms I’m familiar with: control and development. In control we use sampling and statistics to ensure the process is currently within specs, and in product development statistics is used to tentatively eliminate possibilities (via p-values) and to steer future studies (e.g. a “Ranking and Selection” approach)

    2.

    “But how do I tell if the model and priors are any good?”

    You can’t. Not now…

    Model criticism is the whole point of p-values (classical Fisherian, prior predictive, and posterior predictive.) They are generalized goodness of fit summary statistics. If the data don’t fit, the model/prior pair needs to be modified/rejected.

  2. “But how do I tell if the model and priors are any good?
    You can’t. Not now…”

    Model criticism is the whole point of p-values …

    You’ve missed the point which is how well the model predicts.
    P-values are useless in that regard.

    It should go without saying that a model that performs poorly on its training set will likely perform poorly on new data.

    While lack of correlation may indicate a bad model, it’s unfortunately too easy to find some correlation in almost every dataset. Another reason to avoid using it as a measure or indication of performance.

  3. @Dav

    Hi DAV

    You’ve missed the point which is how well the model predicts.
    P-values are useless in that regard.

    Different points, I guess. My point is not how well, but how much better it predicts. If the model is no better than a simpler model on the training data or on the test data, then its predictive capabilities are rather moot, aren’t they? It doesn’t matter how pretty it is, if I can do just as well with the simpler model. Occam’s razor and all that.

    How often do you start with a completely null model? In my experience there is always a simpler existing model (which can be physical thing or method) and we are trying to improve on that.

    As you note, it can be easy to find something that has a partial correlation not equal to zero. The question is always how much better that new-found relationship is. The first step to answering that is internal, the second step is external.

  4. “Things don’t have probabilities”

    Define “have”.

    Casinos with their games and rules, playing cards, dice, life insurance companies, physicists who might think the jury is still out on if the universe is ‘quantum’, and scientists and school kids who observe the obvious convergence of relative frequency in coin flipping, and the sheer number of things that follow a normal distribution, etc., might disagree.

    Not to mention, when going from the mathematical to the real world, Kolmogorov himself noted in his “On Tables of Random Numbers” the contribution of von Mises

    “…the basis for the applicability of the results of the mathematical theory of probability to real ‘random phenomena’ must depend on some form of the frequency concept of probability, the unavoidable nature of which has been established by von Mises in a spirited manner.”

    As well as in his “Foundations of the Theory of Probability”

    “In establishing the premises necessary for the applicability of the theory of probability to the world of actual events, the author has used, in large measure, the work of R. v. Mises”

    Justin

  5. Bill R.,
    Sorry but no. Fitting the model better doesn’t necessarily lead to one with more predictive power. In fact, it usually leads to one that has been overfit and thus generalizes more poorly. Of course, that assumes anyone tried to determine the predictive power in the first place. Never seems to happen in fields that rely on p-value the most (sociology, epidemiology, psychology, the list goes on).

  6. Bill_R,

    Again, no. Goodness of fit and predictive power aren’t the same thing. Read up on how this difference is addressed in machine learning then maybe you’ll understand.

    https://www.iro.umontreal.ca/~lisa/pointeurs/bengio+al-decisiontrees-2010.pdf

    Here’s one attempt to improve predictive power:
    https://biostats.bepress.com/ucbbiostat/paper266/

    All methods suffer from dataset shift
    http://iwann.ugr.es/2011/pdf/InvitedTalk-FHerrera-IWANN11.pdf
    Dataset shift (aka concept drift, et al.) is one reason why it is necessary to continue to find new data to test against.

    Some domains (such as vision) aren’t bothered so much by drift. Others, such as attempting to predict the outcome of sporting events or the effectiveness of a drug on a particular disease, are affected more deeply.

    Justin,

    Probability is a measure of confidence in outcome based upon what you know (and should actually have stated), which makes it somewhat subjective. It is not an inherent property.

    It may involve counts but those things while convenient are often misleading examples. If you have a six-sided object — and that is all that you know — then Pr(side=a | a tossed six-sided object) must be 1/6. Anything else assumes facts not in evidence. The probable outcome of a basketball game must be 50/50 if all that is known there are two teams.

    The latter should point out why events and things don’t inherently have a probability. Under frequentist rules it is a nonsensical calculation because it is a one time event — particularly if neither team has played the other in the past — there’s nothing to count. You can try adding assumptions (such as W/L ratio has predictive power) but then you are changing the premises. In any case, the probability is based upon the information at hand. Your information may be different than mine.

  7. “If you have a six-sided object — and that is all that you know — then Pr(side=a | a tossed six-sided object) must be 1/6. Anything else assumes facts not in evidence.”

    I think assuming a uniform distribution, while often reasonable, is assuming facts not in evidence. All we know here is Pr(side=a | a tossed six-sided object) = p, for some p in [0,1].

    For example, I tossed some tacks thousands of times (see my flippingtacks page). I didn’t know anything about the tacks, but generally knew they could land 2 ways. Why would this mean that the probability is 50-50 just because there are two sides? It turns out, it is more like 61/39.

    “The probable outcome of a basketball game must be 50/50 if all that is known there are two teams.”

    Again, you’re assuming a uniform distribution (which again I don’t think is too unreasonable). All we can know from this is that the probability is p, for some p in [0,1], without making any other assumptions.

    “The latter should point out why events and things don’t inherently have a probability. Under frequentist rules it is a nonsensical calculation because it is a one time event — particularly if neither team has played the other in the past — there’s nothing to count. You can try adding assumptions (such as W/L ratio has predictive power) but then you are changing the premises. In any case, the probability is based upon the information at hand. Your information may be different than mine.”

    What you call premise and logic, I can call the reference set and sample space.

    Also, in regards to some things not having a probability, some of these things a frequentist wouldn’t even call a probability anyway (much like geologists don’t specialize in bones and archaeologists don’t specialize in rocks), but something more like ‘uncertainty’, ‘chance’, etc. Frequentism purposely limits probability to mean a certain thing (convergence of relative frequency, irrelevance of place selection) like many branches of science limit what they are studying.

    Often what are impossible one time events (by critics) often do in fact make sense in frequentist paradigm and/or are equally troublesome for non-frequentist approaches. For example, each person is unique (one time event), but life insurance tables work just fine. A one time event like what is P(world war III) as is not a deal breaker for frequentism. In some cases a frequentist can assign probability to single events using a prediction rule. For example, P(A_n+1) = xbar_n, where it is just a matter of choosing an appropriate statistical model, as Spanos notes, and making your assumptions known. For example, one could model the economies of the world, the relation between countries, measures of aggression, weapon supplies, war-like activities, and so on. There is also a “many worlds” interpretation of frequentism that I will mention just to be complete, and that refers to the “sci-fi” idea that say for a one-time event with probability p = X/N, the event occurred in X worlds out of the N worlds, and this one-time event just happened to occur in our world. The frequentist could also simply use Bayes rule, which is fully in the frequentist domain when it involves general events and not probability distributions on parameters.

    No approach will be great for n=1, small sample, or one-time event. But I think frequentism and Bayes do better than Briggs’ approach, whatever that is !

    Justin

  8. I think assuming a uniform distribution, while often reasonable, is assuming facts not in evidence.

    That’s because you insist on thinking of probability as a distribution of counts while I am saying it is the confidence level of an outcome. Without any additional evidence other than the number of sides (n), one cannot have any confidence in any outcome other than 1/n. Anything else has a built-in bias.

    There is also a “many worlds” interpretation of frequentism that I will mention just to be complete, and that refers to the “sci-fi” idea that say for a one-time event with probability p = X/N, the event occurred in X worlds out of the N worlds …

    But then, they don’t occur in X/N worlds. Talk about assuming facts not in evidence!

    But I think frequentism and Bayes do better than Briggs’ approach
    Not surprising as you don’t seem to see the fundamental difference between the frequentist and Bayesian viewpoints.

    Frequentists don’t attach probabilities to hypotheses or to any fixed but unknown values in general. This is a very important point that you should carefully examine. Ignoring it often leads to misinterpretations of frequentist analyses.

    In contrast, Bayesians view probabilities as a more general concept. As a Bayesian, you can use probabilities to represent the uncertainty in any event or hypothesis. Here, it’s perfectly acceptable to assign probabilities to non-repeatable events, such as Hillary Clinton winning the US presidential race in 2016. Orthodox frequentists would claim that such probabilities don’t make sense because the event is not repeatable.

Leave a Reply

Your email address will not be published. Required fields are marked *