Class - Applied Statistics

How To Do Predictive Statistics: Part IV New (Free) Software Logistic Regression

Previous post in the series (or click above on Class).

Download the code: mcmc.pred.R, mcmc.pred.examples.R. If you downloaded before, download again. This is version 0.21! Only the example code changed since last time.

For an explanation of the theory behind all this, which is markedly different than classical views, get this book: Uncertainty.

We’ve done ordinary regression to death. So it’s time we move to unordinary but still humble regression, i.e. logistic regression, which, like all models, aims for this:

     Pr(Y | new X, old X&Y, Model & assumptions)

where in this case Y is a yes/no, up/down, 0/1 proposition. Like whether Y = “A baby is born with (what doctors say is) low birth weight.”

R has a built-in dataset to play with, invoked with


data(birthwt)
x=birthwt
x$race = as.factor(x$race)

The first line brings it up, the second assigns it to x only for the sake of consistency of all the other code we have been writing, and the last turns the numbers 1, 2, 3 into a factor. You can read all about the dataset by typing ?birthwt.

As an ever-important reminder, the model we build will not be causal. It will be probabilistic. We are not saying, for instance, smoking causes low birth weight, but we do aim to check the conditional probability of low birth weight given a mother smokes or not. Conditional on, as the equation above says, the old data, new guesses of the data, the model itself, and all the other assumptions that went into the mix. Assuming all that is true, we get the probability.

Don’t forget to re-source the code, if you don’t have it in memory (with relevant path, if necessary).

source('mcmc.pred.R')

Fit the classic parameterized model:

fit = MCMClogit(low~age+race+smoke, data=x)

Why just these measures and not the others also included? No reason at all, except that the help for MCMClogit had just these. By all means, try out the others!

We next form a scenario, the “new X”, so that we can calculate the probability. A minor limitation is that the scenario has to be an R data.frame matching the input measures. A limitation is NOT that it has to have all input measures. You (in this case I) said all those measures must be specified to get the probability, so every time you want the probability, you have to specify values of all measures.

Perplexed which scenario to try, since few of us are low birth weight experts? The old data can be used as if it were new. For instance (mixing code and output):

w = x[1,]
w
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523

Ignore the “85”, it is a line number produced by whoever sorted the original data so that all low birth weights came first. It is assigned to w because of the limitation I mentioned of passing in the kind of data.frame, which includes knowledge of all factor levels of the relevant measures. We could also have passed in x[1,]. If we use this scenario we want to calculate

     Pr(low | age = 19, race=2, smoke=0, D, M & A)

where D = the old data. The answer is (drumroll please):

p = MCMClogit.pred(fit,w)
p

I get 0.34. Since this is an approximation to a complicated integral, your answer might differ slightly.

What if we wanted the model-old-data-conditional probability of low birth weight given a 19-year-old mother of race 2 but who smoked? This:

w$smoke = 1
p = MCMClogit.pred(fit,w)
p

I get 0.61, about double. Yours may differ slightly. So, given everything we punched into the model (the priors etc. included), the chance a 19-y-o race = 2 woman smokes doubles, in our estimation, of having a low birth weight baby. Any individual woman like that either will or won’t have a low birth weight baby, a state caused by various things. We are not measuring cause, only probability.

What about 20-y-o’s? 21-y-o’s? Higher? What about race = 1 and race = 3? The only way to find out is to do the calculations.

Again we can create lots of scenarios using the old data, which has the virtue of containing scenarios actually found in nature. But it has the disadvantage of only having scenarios actually measured. But we can use it to get some idea of what to expect. For instance:


# all the old x
p = NA
for(i in 1:nrow(x)){
  p[i] = MCMClogit.pred(fit,x[i,])
}

plot(x$age,p,col=x$race,ylab='Pr(low wt|old data,M)', pch=x$smoke+1)
grid()
legend('topright',c('r1','r2','r3','s0','s1'), col=c(1,2,3,1,1), pch = c(3,3,3,1,2), bty='n')

The legend is a hack because of author laziness, but you get the idea. If we wanted pretty pictures we’d be using ggplot2, but then I’d have to spend too much time explaining that code, and we’d miss the important bits.

The idea is that each old measure was taken as a new scenario. The model-etc.-conditional probabilities were estimated, and the result plotted by the measures.

This is the model-estimated-probability! That’s why everything looks so smooth. It says, more or less, that race = 1 non-smokers have the lowest probability, race = 1 smokers and race = 2, 3 non-smokers are the same in probability, and race = 2,3 smokers have the highest probability. And that regardless of race or smoking older women have smaller chances.

For people who are like the gathered data.

About those people I know nothing except what is printed in the help file. Whether these women are like women all over the world in all countries in all time periods of all economic and spiritual backgrounds and so on and so forth, I have no clue. But I doubt it.

Given all those necessary and hard-to-over-emphasize caveats, it is well to recall these are the probabilities of low birth weight babies given whatever conditions we specify. For individual mothers. What if you were a hospital administrator making a guess of the number of low birth weight babies in the coming year, these (perhaps) costing more for whatever reasons? Then you’d have to have another layer of modeling on top of our simple one, which includes patient-type scenarios. How many race = 1, 2, 3 and how many smokers, what ages. And how many patients!

We could do that easily for single-measure patients. Like when w = x[1,]. Then the model is binomial-like. After specifying the number of patients. If you don’t know that, you have to model it, too, as just mentioned.

I’ll leave that for homework. Meanwhile, we are done with logistic regression. It’s as easy as it looks. Next time we start multinomial regression!

2 replies »

Leave a Reply

Your email address will not be published. Required fields are marked *