Class - Applied Statistics

How To Do Predictive Statistics: Part VI New (Free) Software Poisson Regression

Previous post in the series (or click above on Class). REVIEW!

Download the code: mcmc.pred.R, mcmc.pred.examples.R. If you downloaded before, download again. This is version 0.22! Only the example code changed since last time.

For an explanation of the theory behind all this, which is markedly different than classical views, get this book: Uncertainty.

We’re downloading data from another gent’s site, so you have to be on-line for today’s lesson.


# https://stats.idre.ucla.edu/r/dae/poisson-regression/
x <- read.csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
x <- within(x, {
  prog <- factor(prog, levels=1:3, labels=c("General", "Academic", 
                                                     "Vocational"))
  id <- factor(id)
})
summary(x)

It's fictional data: "the outcome variable [num_awards] and indicates the number of awards earned by students at a high school in a year, math is a continuous predictor variable and represents students' scores on their math final exam, and prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled."

Number of awards can be 0, 1, 2, ..., where that "ellipsis" means awards can go up to infinity, which is exactly the hope of Millennials. In reality, of course, awards cap out. How many are possible depends on the situation. About which here we now knowing, since it's all made up anyway.

Poisson distributions allow infinity. Thus they will always be approximations to reality, which does not have infinities. How much an approximation this is in any case, and how loose the approximation, again depends on the situation. Really, an ordered multinomial-type situation is what always pertains. Number of anything falls into progressive buckets from zero to some fixed number.

But here we shall pretend, as everybody does, that the approximation is better than reality. What's the Deadly Sin of Reification among friends?

We already know how to do this. Lucky for us, MCMCpack creators did the work for us as they did with ordinary regression, but not for multinomial regression.

fit = MCMCpoisson(num_awards ~ prog*math,data=x)

We now (and you reviewed so you already know this) move to finding probabilities of scenarios; i.e. of finding

     Pr (Y | new X, old X&Y, Model & Assumptions)

where here Y = "0 awards" or "1 award", etc., X is the program and math score, and the model, including its assumptions about priors and so forth, we already discussed.

As ever, we have built-in scenarios in the old data. Hence

# single new data
p = MCMCpoisson.pred(fit,x[1,])
plot(p,xlab='num_awards', ylab='Pr(num_awards|D,M)')

I won't here show the plot, but it gives a spike at 0, with estimated probability of about 0.87 , a probability of 0.12 for 1 award, about 0.01 for 2 awards, and close-enough-for-most-folks 0 for 3 and 4 awards. This is for an (as typing x[1,] in R shows) a person in a NEW person in a Vocational program and math score of 41.

It is for a NEW person because, as you remember from your reviewing, we do not need probability models to tell us what happened. We just look!

What if we wanted to compare the probability for a person with the same math score but in an Academic vocation? Like this, which is a real pain in the keister:

w = x[12,] # where I hunted for an x that had Academic
w$math = 41
p = MCMCpoisson.pred(fit,w)
plot(p,xlab='num_awards', ylab='Pr(num_awards|D,M)')

I get probability of 0.77 for 0 awards, 0.20 for 1, 0.03 for 2, and near-enough-0 for the rest.

But why can't we do this?

w = x[1,]
w$prog = 'Academic'
p = MCMCpoisson.pred(fit,w)

Because R strips the class of factor from w$prog and turns it into a character. Recall whatever we pass into the prediction functions has to be the same kind of data.frame that went in. The variable prog went in as a factor with 3 levels, and not a character. You might try this:

w = x[1,]
w$prog = 'Academic'
levels(w$prog) = levels(x$prog)

But that turns w$prog to General! I do not know why R strips the class from a factor variable when there is only one row upon substitutions, but it does. For instance, we can do this:

w = x[1:2,]
w$prog[1] = 'Academic'
w = w[-2,]

That works fine: summary(w) shows w$prog is still a three level factor.

So here is the solution:

w = x[1,]
w$prog[1] = 'Academic'
p = MCMCpoisson.pred(fit,w)

That [1] on the second line does the trick! Like I said, pain in the keister.

Okay, let's do our old trick of using all the old data as scenarios. First let's get the probabilities.


# more complicated, because y is open ended; a guess has to be made in 
# number of cols of q; if subscript out of bounds, increase size
q = matrix(0,nrow(x),16)
for(i in 1:nrow(x)){
  a = MCMCpoisson.pred(fit,x[i,])
  q[i,1:length(a)]=a
}

We do not know the limit of the number of awards, but we have to store them. See that "16"? Try putting 8. It will break because some scenarios generate positive (defined as above machine limit) probability for number of awards larger than 7 (recall 0 awards takes a column in q). Of course, you could always put some huge number in place of 16 and never have a problem. Until you want to make pretty plots. As you see next.


par(mfrow=c(2,2))
for (i in levels(x$prog)){
  j = which(x$prog == i)
  plot(0:(dim(q)[2]-1),q[1,],xlab='num_awards', ylab='Pr(num_awards|D,M)', main=i,ylim=c(min(q),max(q)), xlim=c(0,8), type='l')
  for(k in 2:nrow(x)){
    lines(0:(dim(q)[2]-1),q[j[k],])
  }  
}

You can see we limit here the number of awards probabilities to stop at 8 (in the xlim). That obviously has to be changed depending on the application.

Every scenario results in a distribution, probabilities for number of awards from 0 to 8 (max shown). In the scenarios everybody has a program, which are only three, hence three plots. But everybody also has a math score, which obviously varies. That variability does not do too much to change the probability of awards for General or Vocational programs NEW people. But it must for Academic.

The next trick would be to make a plot of only Academic NEW people, where perhaps you color the distributions so that higher math scores are a brighter and brighter red (or whatever color scheme you like).

Math scores goes from 33 to 75. Another fun thing would be to put these scores into discrete decision buckets (which you recall from your review we must always do), and then make the same plots for each of these decision buckets. How? Use the cut function. Do maybe


x$math.cut = cut(x$math, breaks = c(0,45, 52, 59, 80))

which chunks it into the observed quartiles, Do this only if these make sensible decision points. Then you need only change the code above to reflect the new plot, e.g.

...
for (i in levels(x$math.cut)){
j = which(x$math.cut == i)
...

I did it and it worked fine, and showed something interesting. That's your homework!

Next time tobit regression.

4 replies »

  1. No need to convert the numerical id into a factor object. It is not used in the model, and cannot be used in the model as a factor variable either. Try it to see what kind of results you would get.

    Not sure why you’d hunt for an x in the old data and then use it to make a new data point, the following is an cleaner, more efficient alternative.
    w$prog = as.factor(‘Academic’)
    If you have a new data point, since the model contains the variables prog and math, simply use

    newdata=data.frame(prog=’Academic’, math=41)
    MCMCpoisson.pred(fit,newdata)

    if the function MCMCpoisson.predit is written in the same way how predictions are programmed in other regression packages, be it for Bayesian or classical method. Yes, classical methods make predictions too.

  2. … math is a continuous predictor variable

    Since every decision, like every measure, is discrete and finite in act, so, why didn’t discretized/bucketized the math variable in your analyses here?

Leave a Reply

Your email address will not be published. Required fields are marked *