Class - Applied Statistics

How To Do Predictive Statistics: Part IX Stan — Logistic & Beta Regression

Review!

We’re doing logistic and beta regression this time. These aren’t far apart, because the observable for both lives between 0 and 1; for logistic it is 0 or 1; for beta, any fraction or ratio—but not probability–that is on (0,1) works. We don’t model probability; we use probability to model.

That brings up another blunt point. In these demonstrations I do not care much about the models themselves. I’m skipping over all the nice adjustments, tweaks, careful considerations, and other in-depth details about modeling. Most of that stuff is certainly of use, if tainted by the false belief, shared by both frequentists and Bayesians, that probability exists.

I am not here to teach you how to create the best model for this or that kind of observable. I am not here to teach you best coding practices. I am here to teach you the philosophy of uncertainty. Everything takes an economy class nonrefundable ticket seat to that. Because that’s true, it’s more than likely missed code shortcuts and model crudities will be readily apparent to experts in these areas. You’re welcome to put corrective tips in the comments.

On with the show!

Logistic

Let’s use, as we did before, the MASS package dataset birthwt.


library(MCMCpack)
library(rstanarm)

x=MASS::birthwt # we always store in x for downstream ease
x$race = as.factor(x$race)

fit.m =  MCMClogit(low ~ smoke + age + race + ptl + ht + ftv, data=x)
fit.s = stan_glm (low ~ smoke + age + race + ptl + ht + ftv, family=binomial, data=x)

Last time we didn’t put all those other measures in; this time we do. Notice that we specify the data in both methods: rstanarm needs this, and it’s good practice anyway.



# This is a copy-paste from the MCMClogit lesson; only changing to p.m
p.m = NA
for(i in 1:nrow(x)){
  p.m[i] = MCMClogit.pred(fit.m,x[i,])
}

plot(x$age,p.m,col=x$race,ylab='Pr(low wt|old data,M)', pch=x$smoke+1,main='MCMC')
grid()
legend('topright',c('r1','r2','r3','s0','s1'), col=c(1,2,3,1,1), pch = c(3,3,3,1,2), bty='n')

Save that plot. Recall that the three races and two smoking states are given different colors and plotting characters. There is more to each scenario than just these measures, as the model statements show. But this is a start.



# This is new
p.s = posterior_predict(fit.s)
plot(x$age,colMeans(p.s),col=x$race,ylab='Pr(low wt|old data,M)', pch=x$smoke+1,main='Stan')
grid()
legend('topright',c('r1','r2','r3','s0','s1'), col=c(1,2,3,1,1), pch = c(3,3,3,1,2), bty='n')


Notice in the plot we have to do colMeans(p.s) to get the probability estimates—this is the tweak I mentioned last time. That’s because p.s contains nothing buy 189 columns (same as original data length) of 0s and 1s. Remember these are predictions! We take the average of the predictions, at each scenario, to get the probability estimate.

Stare and see the differences in the plots. We can’t put them all on one so easily here. While there are some differences, my guess is that no one would make any different decision based on them.

For homework, you can use rstanarm to check for measure relevancy and importance. Recalling, as you do so, that these are conditional concepts! As all probability is.

What’s that? You don’t remember how to do that? You did review, right? Sigh. Here’s one way checking the measure ptl, number of previous premature labours.



# The commented out lines we already ran; they're in memory
#fit.s = stan_glm (low ~ smoke + age + race + ptl + ht + ftv, family=binomial, data=x)
fit.s.2 = stan_glm (low ~ smoke + age + race +  ht + ftv, family=binomial, data=x)

p.s = posterior_predict(fit.s)
p.s.2 = posterior_predict(fit.s.2)

#a.1 = colMeans(p.s)
a.2 = colMeans(p.s.2)

plot(a.1,a.2, xlab='Full model', ylab='Sans ptl', main='Pr(low wt|old data, Ms)')
  abline(0,1)

Obviously ptl is relevant in the face of all these other measures. Would it be excluding others? Or with new observations? You check. That’s an order: you check. Would you say, as a decision maker interested in predicting low birth weight, that the probabilities change enough such that different decisions would be made using the different models? If so, then ptl is important, and should be kept in this model; i.e. in this form of the model with all the other measures in, too. If not, then it should be chucked.

There is no longer any such thing as hypothesis testing, or model building without reference to the decisions to be made using the model. It is impossible, beyond raw relevancy, which is trivial to see, that model building is independent of decision making.

Beta regression

We can’t do comparisons here, because only rstanarm has this kind of model. It’s for observables living on (0,1), things like ratios, fractions, and the like. The idea (which you can look up elsewhere) is that uncertainty in the observable y is characterized with a beta distribution. These have two parameters—normal distributions do, too. Unlike with normals, here we can model linear functions of both (suitably transformed) parameters. This is done with normals, too, in things like GARCH time series. But it’s not usual for regular regressions.

For beta regression, one or both parameters is transformed (by logging or identity, usually), and this is equated to a linear function of measures. The linear function does not have to be the same for both transformed parameters.

In the end, although there are all kinds of considerations about the kinds of transforms and linear functions, we are really interested in predictions of the observable, and its uncertainty. Meaning, I’m going to use the defaults on all these models, and will leave you to investigate how to make changes. Why? Because no matter what changes you make to the parameterizations, the predicitons about observables remains the same. And that is really all that counts. We are predictivists!

We last time loaded the betareg package. We’re not going to use it except to steal some of its data (the rstanarm package doesn’t have any suitable).

library(betareg)
data('GasolineYield', package='betareg')
x = GasolineYield

attr(x$batch, "contrasts") <- NULL # odd line

?GasolineYield # to investigate the data

We're interested in yield: "proportion of crude oil converted to gasoline after distillation and fractionation" and its uncertainty relating to temperature and experimental batch. There are other measures, and you can play with these on your own.

The "odd line" removes the "contrasts" put there by the data collectors, and which are used to create contrasts in the parameters; say, checking whether the parameter for batch 2 was different than for batch 10, or whatever. We never care about these. If we want to know the difference in uncertainty in the observable for different batches, we just look. Be cautious in using canned examples, because these kinds of things hide. I didn't notice it at first and was flummoxed by some screwy results in some generated scenarios. People aren't coding these things with predictions in mind.


fit = stan_betareg(yield ~ batch + temp | temp, data=x)
p = predictive_interval(fit)


The first transformed parameter is a linear function of batch and temp. The second---everything to the right of "|"---is a linear function of temperature. This was added because, the example makers say, the lone parameter model wasn't adequate.

How do we check adequacy? Right: only one way. Checking the model's predictions against new observables never before used in any way. Do we have those in this case? No, we do not. So can we adequately check this model? No, we cannot. Then how can we know the model we design will work well in practice? We do not know.

And does everything just said apply to every model everywhere for all time? Even by those models released by experts who are loaded to their uvulas with grant money? Yes: yes, it does. Is vast, vast, inconceivable over-certainty produced when people just fit models and release notes on the fit as if these notes (parameter estimates etc.) are adequate for judging model goodness? Yes, it is. Then why do people do this?

Because it is easy and hardly anybody knows better.

With those true, sobering, and probably to-be-ignored important words, let's continue.


plot(x$temp,p[,2],type='n',ylim=c(0,.6),xlab='Temperature',ylab='Yield')
for(i in 1:nrow(p)){
   lines(c(x$temp[i],x$temp[i]),c(p[i,1],p[i,2]))
   text(x$temp[i],mean(c(p[i,1],p[i,2])), as.character(x$batch[i]))
}
grid()


This pictures heads up today's post.

We went right for the (90%) predictive intervals, because these are easy to see, and plotted up each batch at each temperature. Depends on the batch, but it looks like as temperature increases, we have some confidence (and I do not mean this word in its frequentist sense) yield increases.

Let's do our own scenario, batch 1 at increasing temperatures.


y = data.frame(batch="1",temp=seq(200,450,20))
p.y = predictive_interval(fit,newdata=y)
plot(y$temp,p.y[,2],type='n',ylim=c(0,.6),xlab='Temperature',ylab='Yield')
for(i in 1:nrow(p.y)){
   lines(c(y$temp[i],y$temp[i]),c(p.y[i,1],p.y[i,2]))
}


This is where I noticed the screwiness. With the contrasts I wasn't getting results that matched the original data, when, of course, if I make up a scenario that is identical to the original data, the predictions should be the same. This took me a good hour to track down, because I failed to even (at first) think about contrasts. Nobody bats a thousand.

Let's do our own contrast. Would a decision make do anything different regarding batches 7 and 8?


y = data.frame(batch="7",temp=seq(200,450,20))
  p.y.7 = predictive_interval(fit,newdata=y)
y = data.frame(batch="8",temp=seq(200,450,20))
  p.y.8 = predictive_interval(fit,newdata=y)

plot(p.y.7[,1],p.y.8[,1],type='b')
  lines(p.y.7[,2],p.y.8[,2],type='b',col=2)
  abline(0,1)


This only checks the 90% interval. If the decision maker has different important points (say, yields greater than 0.4, or whatever), we'd use those. Different decision makers would do different things. A good model to one decision maker can be a lousy one to a second!

Keep repeating these things to yourself.

The batch 7 gives slightly higher upper bounds on the yields. How much? Mixing code and output:


> p.y.7/p.y.8
         5%      95%
1  1.131608 1.083384
2  1.034032 1.067870
3  1.062169 1.054514
4  1.129055 1.101601
5  1.081880 1.034637
6  1.062632 1.061596
7  1.068189 1.065227
8  1.063752 1.048760
9  1.052784 1.048021
10 1.036181 1.033333
11 1.054342 1.028127
12 1.026944 1.042885
13 1.062791 1.037957


Say 3%-8% higher. That difference enough to make a difference? Not to me. But what the heck do I know about yields like this. Answer: not much. I am a statistician and am incompetent to answer the question---as is each statistician who attempts to answer it with, God help us, a p-value.

At this point I'd ask the client: keep these batches separate? If he says "Nah; I need 20% or more of a difference to make a difference", then relevel the batch measure:


levels(x$batch) = c('1','2','3','4','5','6','7-8','7-8','9','10')

Then rerun the model and check everything again.

1 reply »

  1. I am going to put on the chemist’s hat, though I suspect the chemical engineer’s hat would be better. Crude oils are rather variable mixtures of hydrocarbons and substituted hydrocarbons. Yields of gasoline, which itself is a rather variable mixture of hydrocarbons, will vary with other conditions besides temperature; where temperature here means thermodynamic temperature (a proxy for internal kinetic energy and only its internal kinetic energy). Then you have economic constraints added to your decision making. Excellent post, our most gracious host.

Leave a Reply

Your email address will not be published. Required fields are marked *