How To Do Predictive Statistics: Part X: Verification 1

How To Do Predictive Statistics: Part X: Verification 1

We are finally at the most crucial part of the modeling process: proving if the damned thing works.

Not “works” in the sense that we get good unobservable parameter estimates, but works in the sense that the model makes useful, skillful predictions. No model should ever be released, by any researcher, until it has been verified at least using old data, and preferably using never-before-seen-or-used-in-any-way data. I’ll later have a paper on this subject, but I go on and on about it in this award-eligible book.

We’re going to use the Boston Housing data available in the mlbench package in R.


install.packages("mlbench")
require(mlbench)
data(BostonHousing)
?BostonHousing

The last line of code will explain the dataset, which is 500-some observations of median housing prices (in $1,000s), from the 1970 Census, in different Boston neighborhoods. The key measure was nox, atmospheric “nitric oxides concentration (parts per 10 million)”, which we will take was measured without error. Which is not likely. Meaning that if we could take into account whatever uncertainty in the measure exists, we should use it, and the results below would be even less certain.

The idea was that high nox concentrations would be associated with lower prices, where “associated” was used as a causal word. To keep the example simple yet informative, we only use some of the measures: crim, per capita crime rate by town; chas, Charles River border indicator; rm, average number of rooms per dwelling; age, proportion of owner-occupied units built prior to 1940; dis, weighted distances to five Boston employment centres; tax, full-value property-tax rate; and b, a function of the proportion of blacks by town.

The ordinary regression gives this ANOVA table:


fit = glm(medv ~ crim + chas + nox + rm + age + dis + tax + b, data=BostonHousing)
summary(fit)

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.035298   4.193677  -2.631  0.00877  
crim         -0.110783   0.036748  -3.015  0.00270  
chas1         4.043991   1.010563   4.002 7.25e-05 
nox         -11.068707   4.145901  -2.670  0.00784  
rm            7.484931   0.381730  19.608  < 2e-16 
age          -0.068859   0.014473  -4.758 2.57e-06 
dis          -1.146715   0.207274  -5.532 5.12e-08 
tax          -0.006627   0.002289  -2.895  0.00395 
b             0.012806   0.003142   4.076 5.33e-05 

Look at all those excitingly wee p-values. Glorious, no? No. We'll soon see they lead to bitter disappointment.

Let's now fit the Bayesian version of the same regression, using defaults on the priors as we've been doing. We'll fit two models: one with nox, one without.


fit.s = stan_glm (medv ~ crim + chas + nox + rm + age + dis + tax + b,data=BostonHousing)
fit.s.n = stan_glm (medv ~ crim + chas + rm + age + dis + tax + b,data=BostonHousing)

We could look at the summaries of these models, and you should, but they would only give information about the posterior parameter distributions. Since we're using numerical approximations (MCMC methods) to give answers, we should see if the algorithms are working. They are. We could do better by tuning the approximation (larger resamples), but the defaults are close enough for us here.

So how do we check if the model works?

We have to ask a question that is pertinent to a decision we would make using it. The third quartile observed housing price was $35,000. What is the predicted probability prices would be higher than that given different levels of nox for data not yet observed? The answer for old data can be had by just looking. Why this question? Why not? If you don't like it, change it!

In order to answer our question, we also have to specify values for crim, chas, and all the other measures we chose to put into the model. I picked median observed values for all. If you have other questions or other values you like, try them!

Let's look at the predictions of housing price for future varying levels of nox: I used a sequence from the minimum to maximum observed values, with all other measures at their medians.


nnox = seq(min(BostonHousing$nox),max(BostonHousing$nox),by=.01)
s = length(nnox)
newdata = data.frame(crim = rep(median(BostonHousing$crim),s) , 
                     chas = rep("0",s) , 
                     nox = nnox ,
                     rm = rep(median(BostonHousing$rm),s) ,
                     age = rep(median(BostonHousing$age),s) ,
                     dis = rep(median(BostonHousing$dis),s) ,
                     tax = rep(median(BostonHousing$tax),s) ,
                     b = rep(median(BostonHousing$b),s) )
p.s.n = posterior_predict(fit.s,newdata)
p.s.non = posterior_predict(fit.s.n,newdata)

We built predictions many times before, so if you've forgotten the structure of this code, review! Now let's see what that looks like, in a relevance plot:


  plot(nnox,colMeans(p.s.n>35), type='l', lwd =3, xlab='nox (ppm)', ylab='Pr(Price > $35,000 | XDM)')
    lines(nnox,colMeans(p.s.non>35), lwd = 2, col=4)
    grid()
    legend('topright',c('With nox','No nox'), col=c(1,4), lty=1, lwd=3, bty='n')

The predictive probability of high housing prices goes from about 4% with the lowest levels of nox, to something near 0% at the maximum nox values. The predictive probability in the model without nox is about 1.8% on average. The original p-value for nox was 0.008, which all would take as evidence of strong effect. Yet for this question the probability changes are quite small. Are these differences (a +/- 2% swing) in probability enough to make a difference to a decision maker? There is no single answer to that question. It depends on the decision maker. And there would still not be an answer until it was certain the other measures were making a difference. I'll leave that as homework.

We need this helper function, which is a simple adaptation of the original, which I don't love. If you'd rather use the original, have a ball.


 ecdf <- function (x) 
{
    x <- sort(x)
    n <- length(x)
    vals <- unique(x)
    rval <- cumsum(tabulate(match(x, vals)))/n
    return(list(ecdf=rval,vals=vals))
}

Now this is a fun little plot. It shows the probability prediction of Y for every old observed X, supposing that old X were new. This assumes my version of ecdf.


# the old data is reused as if it were new
p.s = posterior_predict(fit.s) 
p.s.n = posterior_predict(fit.s.n) 

P = ecdf(p.s[,1])
plot(P$vals, P$ecdf, type='l',xlim=c(-20,60), xlab="Price = s", ylab="Pr(Price < s | XDM)")
for (j in 2:nrow(BostonHousing)){
  P = ecdf(p.s[,j])
  lines(P$vals, P$ecdf, type='l')
}
grid()
abline(v=0,lty=2,col=2,lwd=2)

Price (s) in on the x-axis, and the probability of future prices less than s, given the old data and M(odel), are on the y-axis. A dotted red line at $0 is shown. Now we know based on external knowledge to M that it is impossible prices can be less than $0. Yet the model far too often gives positive probabilities for impossible prices. The worst prediction is about a 65% chance for prices less than $0. You remember we call this probability leakage.

So right now we have good evidence this model has tremendous failure points. It is not a good model! And we never would have noticed had we not examined the model in its predictive form---and we also remember all models have a predictive form: even if you don't want to use that form, it's there. What we should do at this point is change the model to remove the leakage (and surely you recall we know how to do that: review!). But we won't: we'll keep it and see how the rest of the model verifies.

Let's look at four predictions, picking four old X, because the all-predictions plot is too busy. And on this we'll over-plot the ECDF of the observations, which are, of course, just step functions.


par(mfrow=c(2,2))
for (j in base::sample(1:nrow(BostonHousing),4)){
  P = ecdf(p.s[,j])
   plot(P$vals, P$ecdf, type='l',xlim=c(-20,60), xlab="Price = s", ylab="Pr(Price < s | XDM)")
   lines(stepfun(BostonHousing$medv[j],c(0,1)),xval = P$vals,do.points=FALSE)
   grid()

}

From this plot we have the idea that the closer the ECDF of the prediction is to the ECDF of the observation, the better the model does. This is a good idea, indeed, a great one. It leads to the idea of a score that measures this distance.

One such, and very well investigated, score is the continuous ranked probability score, or CRPS. It is not the only one, just the one we'll use today.

Let F_i(s) = Pr( Y < s | X_i D_n M), i.e. a probabilistic prediction of our model for (past) measure X_i (which can be old or new; the D_n is the old data, as we have been writing). Here we let s vary, so that the forecast or prediction is a function of s, but s could be fixed, too. Let Y_i be the i-th observed value of Y. Then

    CRPS(F,Y) = sum_i ( F_i - I( s ≥ Y_i ) )^2,

where I() is the indicator function. If s is fixed and Y dichotomous, CRPS is called the Brier score. We can look at CRPS for each i, or averaged over the set. There are various ways to calculate CRPS, depending on the model assumed. Here we'll just use straight-up numerical approximation.

Why not use RMSE or MAD or R^2? Because those scores are not proper. I'll write elsewhere of this, but the idea is those scores throw away information, because they first have to compress the probability (and all our predictions are probabilities) into a point, thus they remove information in the prediction, leading to suboptimal scores. Proper scores have a large literature behind them---and you already know which award-eligible book writes about them!

Let's calculate, for both models with and without nox, the CRPS for each old observation.


crps = NA
crps.n = NA
k = 0
for (j in 1:nrow(BostonHousing)){
  k = k + 1
  sf=stepfun(BostonHousing$medv[j],c(0,1))
  P = ecdf(p.s[,j])
  crps[k] = sum((P$ecdf - sf(P$vals))^2)/length(P$ecdf)
  
  sf=stepfun(BostonHousing$medv[j],c(0,1))
  P = ecdf(p.s.n[,j])
  crps.n[k] = sum((P$ecdf - sf(P$vals))^2)/length(P$ecdf)

}

plot(BostonHousing$medv,crps, xlab="Price ($1,000s)", ylab="CRPS (with nox)",ylim=c(0,.35))


The stepfun creates a function which we use to compute the value of the ECDF of the observed price, which we need in our approximation of CRPS. Note, too, it uses the already computed posterior predictions, p.s. The crps is for the nox model; and crps.n is for the non-nox model. The plot is of the individual CRPS at the values of prices.

See that "floor" in the CRPS values? Looks like CRPS can go so low, but no lower. This turns out to be true. Given a model form, we can calculate an expected CRPS, and then bust up that expectation into pieces, each representing part of the score. One part is the inherent variability of the Y itself. Meaning, given a model, we have to accept the model will only be so good, that uncertainty will remain. There is much more to this, but we'll delay discussion for another day.

We could also do this:


plot(crps.n,crps, ylab="CRPS (with nox)",  xlab="CRPS (no nox)")
  abline(0,1)

I won't show it, because it's hard to see, by eye, whether the nox model is doing better. But you can look on your own.

Enter the idea of skill---which was another of Fisher's inventions. Again, you know where you can read about it.

Skill scores K have the form:

    K(F,G,Y) = ( S(G,Y) - S(F,Y) ) / S(G,Y),

for some score function S(.,Y) (like CRPS), where F is the prediction from what is thought to be the superior or more complex model and G the prediction from the inferior. Skill is always relative. Since the minimum best score is S(F,Y)=0, and given the normalization, a perfect skill score has K = 1. Skill exists if and only if K >0, else it is absent. Skill like proper scores can be computed as an average or over individual points.

Models without skill should not be used!

Why not? Because the simpler model beats them! If you don't like CRPS, then you should, always should, use the score that reflects the cost-loss of the decisions you will make using the model. As always, a model may be useful to one man and useless to another.

Here is the skill plot:


plot(BostonHousing$nox,(crps.n-crps)/crps.n, ylab="Skill",  xlab="nox")
  abline(h=0,lty=2,col=2)

Everything below 0 are times when the non-nox model did better.

The overall average skill score was K = -0.011, indicating the more complicated model (with nox) does not have skill over the less complicated model. This means, as described above, that if the CRPS represents the actual cost-loss score of a decision maker using this model, the prediction is that in future data, the simpler model will outperform the more complex one.

We don't actually know how the model will perform on future data, we can only guess. So that model scores on old data are themselves predictions of future performance. How good they are is an open research question; i.e. nobody knows.

Whether this insufficiency in the model is due to probability leakage, or that the CRPS is not the best score in this situation, remain to be seen.

We have thus moved from delightful results as indicated by p-values, to more sobering results when testing the model against reality---where we also recall this is only a guess of how the model will actually perform on future data: nox is not useful. Since the possibility for over-fitting is always with us, it is the case that future skill measures would likely be worse than those seen in the old data.

Most of this post is extracted from an upcoming (already accepted) paper, which will be posted when it is published. This post is only a bare sketch.

2 Comments

  1. Bill Raynor

    Matt,

    Would you care to comment on Transductive Inferences? From Wikipedia:
    “In logic, statistical inference, and supervised learning, transduction or transductive inference is reasoning from observed, specific (training) cases to specific (test) cases. In contrast, induction is reasoning from observed training cases to general rules, which are then applied to the test cases. The distinction is most interesting in cases where the predictions of the transductive model are not achievable by any inductive model. Note that this is caused by transductive inference on different test sets producing mutually inconsistent predictions.

    Transduction was introduced by Vladimir Vapnik in the 1990s, motivated by his view that transduction is preferable to induction since, according to him, induction requires solving a more general problem (inferring a function) before solving a more specific problem ”

  2. Eric S

    Hi Matt,
    Little late posting on this, but in the final section regarding skill, wouldn’t everything below 0 indicate that the nox model did better? Looking at how it was determined, (crps.n-crps) would mean that anytime a number is negative, crps is larger, and since crps includes nox, that would mean it did better than the non-nox model.

    Or am I reading it incorrectly?

Leave a Reply

Your email address will not be published. Required fields are marked *