Skip to content

Category: Class – Applied Statistics

October 3, 2018 | 1 Comment

How To Do Predictive Statistics: Part IX Stan — Logistic & Beta Regression

Review!

We’re doing logistic and beta regression this time. These aren’t far apart, because the observable for both lives between 0 and 1; for logistic it is 0 or 1; for beta, any fraction or ratio—but not probability–that is on (0,1) works. We don’t model probability; we use probability to model.

That brings up another blunt point. In these demonstrations I do not care much about the models themselves. I’m skipping over all the nice adjustments, tweaks, careful considerations, and other in-depth details about modeling. Most of that stuff is certainly of use, if tainted by the false belief, shared by both frequentists and Bayesians, that probability exists.

I am not here to teach you how to create the best model for this or that kind of observable. I am not here to teach you best coding practices. I am here to teach you the philosophy of uncertainty. Everything takes an economy class nonrefundable ticket seat to that. Because that’s true, it’s more than likely missed code shortcuts and model crudities will be readily apparent to experts in these areas. You’re welcome to put corrective tips in the comments.

On with the show!

Logistic

Let’s use, as we did before, the MASS package dataset birthwt.


library(MCMCpack)
library(rstanarm)

x=MASS::birthwt # we always store in x for downstream ease
x$race = as.factor(x$race)

fit.m =  MCMClogit(low ~ smoke + age + race + ptl + ht + ftv, data=x)
fit.s = stan_glm (low ~ smoke + age + race + ptl + ht + ftv, family=binomial, data=x)

Last time we didn’t put all those other measures in; this time we do. Notice that we specify the data in both methods: rstanarm needs this, and it’s good practice anyway.



# This is a copy-paste from the MCMClogit lesson; only changing to p.m
p.m = NA
for(i in 1:nrow(x)){
  p.m[i] = MCMClogit.pred(fit.m,x[i,])
}

plot(x$age,p.m,col=x$race,ylab='Pr(low wt|old data,M)', pch=x$smoke+1,main='MCMC')
grid()
legend('topright',c('r1','r2','r3','s0','s1'), col=c(1,2,3,1,1), pch = c(3,3,3,1,2), bty='n')

Save that plot. Recall that the three races and two smoking states are given different colors and plotting characters. There is more to each scenario than just these measures, as the model statements show. But this is a start.



# This is new
p.s = posterior_predict(fit.s)
plot(x$age,colMeans(p.s),col=x$race,ylab='Pr(low wt|old data,M)', pch=x$smoke+1,main='Stan')
grid()
legend('topright',c('r1','r2','r3','s0','s1'), col=c(1,2,3,1,1), pch = c(3,3,3,1,2), bty='n')


Notice in the plot we have to do colMeans(p.s) to get the probability estimates—this is the tweak I mentioned last time. That’s because p.s contains nothing buy 189 columns (same as original data length) of 0s and 1s. Remember these are predictions! We take the average of the predictions, at each scenario, to get the probability estimate.

Stare and see the differences in the plots. We can’t put them all on one so easily here. While there are some differences, my guess is that no one would make any different decision based on them.

For homework, you can use rstanarm to check for measure relevancy and importance. Recalling, as you do so, that these are conditional concepts! As all probability is.

What’s that? You don’t remember how to do that? You did review, right? Sigh. Here’s one way checking the measure ptl, number of previous premature labours.



# The commented out lines we already ran; they're in memory
#fit.s = stan_glm (low ~ smoke + age + race + ptl + ht + ftv, family=binomial, data=x)
fit.s.2 = stan_glm (low ~ smoke + age + race +  ht + ftv, family=binomial, data=x)

p.s = posterior_predict(fit.s)
p.s.2 = posterior_predict(fit.s.2)

#a.1 = colMeans(p.s)
a.2 = colMeans(p.s.2)

plot(a.1,a.2, xlab='Full model', ylab='Sans ptl', main='Pr(low wt|old data, Ms)')
  abline(0,1)

Obviously ptl is relevant in the face of all these other measures. Would it be excluding others? Or with new observations? You check. That’s an order: you check. Would you say, as a decision maker interested in predicting low birth weight, that the probabilities change enough such that different decisions would be made using the different models? If so, then ptl is important, and should be kept in this model; i.e. in this form of the model with all the other measures in, too. If not, then it should be chucked.

There is no longer any such thing as hypothesis testing, or model building without reference to the decisions to be made using the model. It is impossible, beyond raw relevancy, which is trivial to see, that model building is independent of decision making.

Beta regression

We can’t do comparisons here, because only rstanarm has this kind of model. It’s for observables living on (0,1), things like ratios, fractions, and the like. The idea (which you can look up elsewhere) is that uncertainty in the observable y is characterized with a beta distribution. These have two parameters—normal distributions do, too. Unlike with normals, here we can model linear functions of both (suitably transformed) parameters. This is done with normals, too, in things like GARCH time series. But it’s not usual for regular regressions.

For beta regression, one or both parameters is transformed (by logging or identity, usually), and this is equated to a linear function of measures. The linear function does not have to be the same for both transformed parameters.

In the end, although there are all kinds of considerations about the kinds of transforms and linear functions, we are really interested in predictions of the observable, and its uncertainty. Meaning, I’m going to use the defaults on all these models, and will leave you to investigate how to make changes. Why? Because no matter what changes you make to the parameterizations, the predicitons about observables remains the same. And that is really all that counts. We are predictivists!

We last time loaded the betareg package. We’re not going to use it except to steal some of its data (the rstanarm package doesn’t have any suitable).

library(betareg)
data('GasolineYield', package='betareg')
x = GasolineYield

attr(x$batch, "contrasts") <- NULL # odd line

?GasolineYield # to investigate the data

We're interested in yield: "proportion of crude oil converted to gasoline after distillation and fractionation" and its uncertainty relating to temperature and experimental batch. There are other measures, and you can play with these on your own.

The "odd line" removes the "contrasts" put there by the data collectors, and which are used to create contrasts in the parameters; say, checking whether the parameter for batch 2 was different than for batch 10, or whatever. We never care about these. If we want to know the difference in uncertainty in the observable for different batches, we just look. Be cautious in using canned examples, because these kinds of things hide. I didn't notice it at first and was flummoxed by some screwy results in some generated scenarios. People aren't coding these things with predictions in mind.


fit = stan_betareg(yield ~ batch + temp | temp, data=x)
p = predictive_interval(fit)


The first transformed parameter is a linear function of batch and temp. The second---everything to the right of "|"---is a linear function of temperature. This was added because, the example makers say, the lone parameter model wasn't adequate.

How do we check adequacy? Right: only one way. Checking the model's predictions against new observables never before used in any way. Do we have those in this case? No, we do not. So can we adequately check this model? No, we cannot. Then how can we know the model we design will work well in practice? We do not know.

And does everything just said apply to every model everywhere for all time? Even by those models released by experts who are loaded to their uvulas with grant money? Yes: yes, it does. Is vast, vast, inconceivable over-certainty produced when people just fit models and release notes on the fit as if these notes (parameter estimates etc.) are adequate for judging model goodness? Yes, it is. Then why do people do this?

Because it is easy and hardly anybody knows better.

With those true, sobering, and probably to-be-ignored important words, let's continue.


plot(x$temp,p[,2],type='n',ylim=c(0,.6),xlab='Temperature',ylab='Yield')
for(i in 1:nrow(p)){
   lines(c(x$temp[i],x$temp[i]),c(p[i,1],p[i,2]))
   text(x$temp[i],mean(c(p[i,1],p[i,2])), as.character(x$batch[i]))
}
grid()


This pictures heads up today's post.

We went right for the (90%) predictive intervals, because these are easy to see, and plotted up each batch at each temperature. Depends on the batch, but it looks like as temperature increases, we have some confidence (and I do not mean this word in its frequentist sense) yield increases.

Let's do our own scenario, batch 1 at increasing temperatures.


y = data.frame(batch="1",temp=seq(200,450,20))
p.y = predictive_interval(fit,newdata=y)
plot(y$temp,p.y[,2],type='n',ylim=c(0,.6),xlab='Temperature',ylab='Yield')
for(i in 1:nrow(p.y)){
   lines(c(y$temp[i],y$temp[i]),c(p.y[i,1],p.y[i,2]))
}


This is where I noticed the screwiness. With the contrasts I wasn't getting results that matched the original data, when, of course, if I make up a scenario that is identical to the original data, the predictions should be the same. This took me a good hour to track down, because I failed to even (at first) think about contrasts. Nobody bats a thousand.

Let's do our own contrast. Would a decision make do anything different regarding batches 7 and 8?


y = data.frame(batch="7",temp=seq(200,450,20))
  p.y.7 = predictive_interval(fit,newdata=y)
y = data.frame(batch="8",temp=seq(200,450,20))
  p.y.8 = predictive_interval(fit,newdata=y)

plot(p.y.7[,1],p.y.8[,1],type='b')
  lines(p.y.7[,2],p.y.8[,2],type='b',col=2)
  abline(0,1)


This only checks the 90% interval. If the decision maker has different important points (say, yields greater than 0.4, or whatever), we'd use those. Different decision makers would do different things. A good model to one decision maker can be a lousy one to a second!

Keep repeating these things to yourself.

The batch 7 gives slightly higher upper bounds on the yields. How much? Mixing code and output:


> p.y.7/p.y.8
         5%      95%
1  1.131608 1.083384
2  1.034032 1.067870
3  1.062169 1.054514
4  1.129055 1.101601
5  1.081880 1.034637
6  1.062632 1.061596
7  1.068189 1.065227
8  1.063752 1.048760
9  1.052784 1.048021
10 1.036181 1.033333
11 1.054342 1.028127
12 1.026944 1.042885
13 1.062791 1.037957


Say 3%-8% higher. That difference enough to make a difference? Not to me. But what the heck do I know about yields like this. Answer: not much. I am a statistician and am incompetent to answer the question---as is each statistician who attempts to answer it with, God help us, a p-value.

At this point I'd ask the client: keep these batches separate? If he says "Nah; I need 20% or more of a difference to make a difference", then relevel the batch measure:


levels(x$batch) = c('1','2','3','4','5','6','7-8','7-8','9','10')

Then rerun the model and check everything again.

September 19, 2018 | 4 Comments

How To Do Predictive Statistics: Part VIII — Starting Stan

Review! You must at least review the first lessons—all class material is on one page for ease. I’ll have more words about the mysticism of simulation, but I’ve said it all before and better in the previous links.

Last time we finished up with the MCMCpack extensions. I chose this for its ease and speed, and because of it was mostly automated. Like I said, one reason for the lack of adoption of predictive methods is automatic, out-of-the-box software. Most people won’t or can’t code. It isn’t easy.

We’re now adding complexity. We’ll be able to handle more complicated models—with all the extra confusion that entails. Run these lines of code:


install.packages('rstanarm', dependencies=TRUE)
install.packages('betareg', dependencies=TRUE)

These are get-a-cup-of-coffee lines of code. If you’re on Linux (and who isn’t?), it will take a good while (run and go to lunch). You’ll see lots of warnings as the c code compiles. Be patient, even when it seems like it’s stuck. It isn’t.

We last talked last time about the possibility of JAGS, which is external software that R can hook to. It’s fine, but it comes with a heavy price. It takes forever to learn. Mistakes are easy. It can run any kind of model you can envision, which is a great benefit. But it can suck up inordinate time, which is a great detriment.

Instead of jumping into a canicular caldera of code, we’ll softly stroll over to Stan. Like JAGS, it uses MCMC type methods (but not everywhere). Meaning simulations. Meaning mystical thinking.

As I’ve written (many times!), simulations provide approximations to non-analytic integrals (we use integrals because of insisting on approximate models using infinitely valued, i.e., continuous parameters). As long as we keep in mind they are nothing but numerical methods to difficult problems, we’re fine. But as soon as we allow any metaphysical import to the simulation, like talking about “random” numbers, we are lost.

We simply will not progress as a field until we can discover closed-form analytical solutions to the problems we now give to simulations. As Jaynes said (and as I quote in my award-eligible book) “It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought.”

Amen amen amen and amen.

Even when we can—and it is difficult if you have been brought up in the frequentist or Bayesian faith—manage to think of simulations as nothing but numerical approximations, simulations suffer from another huge flaw. They are slow. Try the package brms, which is Bayesian Regression Models using Stan. Even the simplest regression takes minutes (because it exports the model to external C++ code, which first compiles then runs: ugh).

Very little attention (in statistics, anyway) has been paid to the problem of finding analytic approximations to non-analytic integrals. The applied math guys have done tons here, and it would be worth spending time to see the natural crossovers. I am only one guy with no students, no grants, no facilities (I am a reactionary and no Western university will hire me), no nothing except some spare time. So I can’t do this. It will be up to you, dear reader.

Enough of the necessary and, I pray to God, not futile rant. On with the show!

Stan

Has the code completed yet? First thing we’re going to do is run some comparisons of simple models, regression and logistic regression, using MCMCpack and rstanarm. This is only for the purposes of orientation, and to show how differences in simulation produce differences in answers. Next time we’ll try models not available in MCMCpack. I am so sick of the CGPA data I couldn’t use it again even if Cardinal Dolan promised to resign. So we’ll use some built-in datasets. Be sure to first download the latest versions of mcmc.pred.R, mcmc.pred.examples.R.

Regression

First do ?mtcars to learn all the exciting details about the MPGs of 32 cars. And recall the “.” after the tilde tells R to use all the measures in a dataset (except, of course, for the observable on the left-hand-side).


x = mtcars # only to make life smooth

fit.s = stan_glm(mpg ~ ., data = x, QR = TRUE)
fit.m = MCMCregress(mpg ~ ., data = x)

As ever, you can try summary(fit.s) etc. to look at the parameter summaries. Of which, as predictivists, we have zero interest. Strike that: make it epsilon interest. Because these are simulations, there are all kinds of niceties to attend to so that we know the approximation has “converged.” Things can be learned about this convergence by examining the posteriors of the parameters. However, we must always remind ourselves that the parameters are not real, they are not ontic. They are nothing but mathematical contrivances that make the model work. Where do parameters come from? All those who read Uncertainty know they are the result of taking finite discrete parameterless models to the limit—simply for the ease of approximation. Real finite discrete probability models have no parameters at all. They are natively predictive.

“Okay, Briggs. Suppose you’re right. Then why aren’t you using these super good native finite discrete models here, instead of these continuous-parameter based ones?”

Say, that’s a good question. Because I’ve only worked out one. Again, I have no students, etc. etc. As a mathematician, I make a great philosopher. If you are a student in need of a problem, boy howdy, have I got some good ones.

Anyway, we don’t care about the parameters, but the predictions. First MCMCpack, as we did before (if you can’t remember everything, review! you can lead a student to code, but you can’t make him learn):


# all the old x
q.m = NA
g = 25  # 25 mpg; why 25? why not?

for(i in 1:nrow(x)){
  p.m = MCMCregress.pred(fit.m,x[i,])
  q.m[i]= sum(p.m>=g)/length(p.m)
}

plot(x$disp,q.m,ylab='Pr(MPG>g|old data,M)',main='MCMCpack')


That plots the probability of MPG>25, for all future scenarios that—hey presto!—look just like the old data. These are predictions! Save this plot. Now let’s to rstanarm.



p.s = posterior_predict(fit.s)

q.s = NA
for(i in 1:ncol(p.s)){
  q.s[i]= sum(p.s[,i]>=g)/length(p.s[,i])
}

plot(x$disp,q.s,ylab='Pr(MPG>g|old data,M)',main='Stan')


The object p.s contains the simulated posterior predictions: 4,000 is the default, one column for each row in the newdata which, since it defaults to all the old data, is one column for each of the original 32 observations. A long-winded way to say dim(p.s) is 4,000 x 32. We have to work with those 4,000 posterior predictions for every scenario.

We calculate the probability of future scenarios having MPGs greater than 25 in exactly the same was in rstanarm as with MCMCregress.pred.

No need to be coy about the comparisons. We can put both predictions on one plot (and the plot I used to head the post).

plot(x$disp,q.m,ylab='Pr(MPG>g|old data,M)',main='MCMC vs. Stan')
  points(x$disp,q.s,pch=2,col=1)
  grid()
  legend('topright',c('MCMC','Stan'), pch=1:2, col=1:2, bty='n')

There are differences, but all are such that, I'm guessing, no decision would be changed using one model for the other ---- at least decision point, i.e. at MPG> 25.

You must always remember that it is decision that counts in deciding between good and bad models, whether a measure (an 'x') goes into the model, etc.

The dataset has 9 (or whatever) other measures, like horsepower. You can play with those. rstanarm makes it easy to put in custom scenarios, which is done in pretty much the same way as before. For example:


y = x[1,] # take as scenario the first observation
y$hp = 400 # bump up the horsepower to some gigantic level
p.s = posterior_predict(fit.s, newdata=y)
hist(p.s,100,xlab='MPG')


That's the distribution of possible MPGs for the scenario y. Look at y so you know what it is. Change values to your taste and compare outputs. The histogram took advantage that the new data only had one row. If there is more than one row, you have to index p.s (i.e., p.s[,i]).

You'll see there is already probability leakage with this scenario, but not much; only about 1% (sum(p.s<0)/4000). Your homework is to discover, for reasonably configured cars, you can find scenarios where the leakage is substantial. If it exists. I don't know: I didn't check.

This means as it did before: the normal model is breaking down sometimes. Well, we always knew it was impossible that MPG was "normally distributed". Nothing in the world is "normally distributed". Nothing "has" a normal distribution, or any other kind of distribution. We only use these models to quantify our uncertainty in values of observables.

Before we quit, let's note a neat feature of rstanarm. Mixing code and output:


> p.s = predictive_interval(fit.s)
> p.s
                           5%      95%
Mazda RX4           17.029272 27.77522
Mazda RX4 Wag       16.849237 27.30586
Datsun 710          21.024821 31.37435
Hornet 4 Drive      15.768537 26.32141
Hornet Sportabout   12.460326 22.79026
Valiant             15.107833 25.50869
Duster 360           9.098461 19.97714
Merc 240D           17.095323 27.90923
Merc 230            18.379516 30.58885
Merc 280            13.046051 24.22801
Merc 280C           13.786114 24.79934
...


The predictive_interval() first calls posterior_predict() and then does a sort of quantile() on the columns of the output (of each scenario). The default is a 90% interval, as you can see. You can certainly write your own quantile() function if you want more than just the interval.

That's it! Next we quickly do logistic regression, which I'd skip, except that there is a small twist with rstanarm. Then we move to so-call beta-regression. That's a modle type MCMCpack doesn't have. On the other hand, rstanarm doesn't have my favorite multinomial regression. Proving only that no one package exists to do all you want.

August 24, 2018 | 3 Comments

How To Do Predictive Statistics: Part VII New (Free) Software Tobit Regression

Previous post in the series (or click above on Class). REVIEW!

Download the code: mcmc.pred.R, mcmc.pred.examples.R. If you downloaded before, download again. This is version 0.3! Only the example code changed since last time.

For an explanation of the theory behind all this, which is markedly different than classical views, get this book: Uncertainty.

Let’s return to the despised CGPA data. I know, I know, but it’s a perfect way to demonstrate tobit regression, which is exactly like ordinary regression except the Y can be censored above or below.

We saw with ordinary regression that there was substantial probability leakage in many scenarios because CGPA can only live on 0 to 4. We had large probabilities for CGPAs both greater than 4 and less than 0. Well, why not disallow it, censoring at these levels?

Start by reading the data back in:

x <- read.csv("http://wmbriggs.com/public/cgpa.csv")

As with multinomial we have to track the model formula, a limitation of the MCMCpack.


# cgpa can only exist between 0 and 4; the ordinary regression
# above exhibits strong probability leakage
form = formula('cgpa~hgpa*sat')
fit = MCMCtobit(form, below=0,above=4,data=x, mcmc=30000)

You can see the places for censoring, and also that I upped the MCMC samples a bit. Just to show how it's done.

Okay, now to scenarios? why? Surely you reviewed! Because we want as always:

     Pr (Y | new X, old X&Y, Model & Assumptions)

where here Y = "CGPA = c", X is the HGPA and SAT scores and the model, including its assumptions about priors and so forth, we already discussed.


p=MCMCtobit.pred(fit,form,x[6,],0,4)
hist(p,300)
quantile(p,c(0.05,.5,.95))

I'm not showing the histogram here. You do it.

If you saw an error about truncnorm you will need to install that package with its dependencies, like this:

install.packages('truncnorm', dependencies = TRUE)

Why x[6,] and not x[1,] like we always do? No reason whatsoever. I think I aimed for 1 and hit 6, and was too lazy to change it. Anyway, I got for the quantile function


       5%       50%       95% 
0.2269157 0.8828441 1.6033544 


So there's a 90% chance a NEW person like x[6,] will have a CGPA between 0.23 and 1.6. Your results will differ slightly, as usual. If they differ a lot, increase that mcmc = 30000 in the model fit function.

The 6 person had a SAT of 756 and HGPA of 0.33. What if we wanted probabilities for a fixed scenario? As we've done before


w = x[6,]
w$sat = 1e6
p=MCMCtobit.pred(fit,form,w,0,4)
hist(p,300)
quantile(p,c(0.05,.5,.95))

Gives basically a 100% of CGPA = 4. Everybody who thinks this is the wrong answer, raise their your hands. One, two, ... Hmm. You all fail.

This is the right answer. Even before with probability leakage the model gave the right answer. Unless you have a typo in your code, models always give the right answers! That's because all models are conditional on only the information you provide, which includes information on the model structure. Here three is NO information about limits of SAT scores, only limits on CGPA. So we can specify whatever we like for SAT, even a million.

Outside of this model we know it's BS. But that makes our model assumptions wrong, not our model.

Now that that's understood finally and for all time, let's do our trick of looking at all the scenarios in the data.


# Has to be a loop! array, marray, larray, and etc. all convert
# the data.frame to a matrix or a unusable list! 
# R has no utility to step through a data.frame while mainting class
# all the old x
q = matrix(0,nrow(x),3)
for(i in 1:nrow(x)){
  q[i,]=quantile(MCMCtobit.pred(fit,form,x[i,]),c(0.1,.5,.9))
}

We discussed before why the loop is necessary. You reviewed, from the beginning, so I know you already know. So there was no reason for me to mention it here.

I went with a 80% predictive interval and not 90% as above. Why not? 90% is harsh. Most people for many decisions will be happy with 80% (conditional) certainty. Feel free to change to---AS YOU MUST---for your decisions.

Look at the answers:


plot(x$sat,q[,2],ylab='cgpa',ylim=c(min(q),max(q)))
for(i in 1:nrow(x)){
  lines(c(x$sat[i],x$sat[i]),c(q[i,1],q[i,3]),col=3)
}

# or

plot(x$hgpa,q[,2],ylab='cgpa',ylim=c(min(q),max(q)))
for(i in 1:nrow(x)){
  lines(c(x$hgpa[i],x$hgpa[i]),c(q[i,1],q[i,3]),col=3)
}

Same as we did for ordinary regression---which you remember, since you reviewed. So I won't show those here. You do them.

How about scenarios for probabilities of CGPA greater than 3? Same as before:


# all the old x
q = NA
g = 3
for(i in 1:nrow(x)){
  p = MCMCtobit.pred(fit,form,x[i,])
  q[i]= sum(p>=g)/length(p)
}

# Then

plot(x$sat,q,ylab='Pr(CGPA>g|old data,M)')

# Or

plot(x$hgpa,q,ylab='Pr(CGPA>g|old data,M)')

You can color the dots using hgpa, or put them into decision buckets, or whatever. You can redo the 3D plots, too.

We've reached the end of the pre-packaged MCMpack extensions! Next up, JAGS.

August 16, 2018 | 4 Comments

How To Do Predictive Statistics: Part VI New (Free) Software Poisson Regression

Previous post in the series (or click above on Class). REVIEW!

Download the code: mcmc.pred.R, mcmc.pred.examples.R. If you downloaded before, download again. This is version 0.22! Only the example code changed since last time.

For an explanation of the theory behind all this, which is markedly different than classical views, get this book: Uncertainty.

We’re downloading data from another gent’s site, so you have to be on-line for today’s lesson.


# https://stats.idre.ucla.edu/r/dae/poisson-regression/
x <- read.csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
x <- within(x, {
  prog <- factor(prog, levels=1:3, labels=c("General", "Academic", 
                                                     "Vocational"))
  id <- factor(id)
})
summary(x)

It's fictional data: "the outcome variable [num_awards] and indicates the number of awards earned by students at a high school in a year, math is a continuous predictor variable and represents students' scores on their math final exam, and prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled."

Number of awards can be 0, 1, 2, ..., where that "ellipsis" means awards can go up to infinity, which is exactly the hope of Millennials. In reality, of course, awards cap out. How many are possible depends on the situation. About which here we now knowing, since it's all made up anyway.

Poisson distributions allow infinity. Thus they will always be approximations to reality, which does not have infinities. How much an approximation this is in any case, and how loose the approximation, again depends on the situation. Really, an ordered multinomial-type situation is what always pertains. Number of anything falls into progressive buckets from zero to some fixed number.

But here we shall pretend, as everybody does, that the approximation is better than reality. What's the Deadly Sin of Reification among friends?

We already know how to do this. Lucky for us, MCMCpack creators did the work for us as they did with ordinary regression, but not for multinomial regression.

fit = MCMCpoisson(num_awards ~ prog*math,data=x)

We now (and you reviewed so you already know this) move to finding probabilities of scenarios; i.e. of finding

     Pr (Y | new X, old X&Y, Model & Assumptions)

where here Y = "0 awards" or "1 award", etc., X is the program and math score, and the model, including its assumptions about priors and so forth, we already discussed.

As ever, we have built-in scenarios in the old data. Hence

# single new data
p = MCMCpoisson.pred(fit,x[1,])
plot(p,xlab='num_awards', ylab='Pr(num_awards|D,M)')

I won't here show the plot, but it gives a spike at 0, with estimated probability of about 0.87 , a probability of 0.12 for 1 award, about 0.01 for 2 awards, and close-enough-for-most-folks 0 for 3 and 4 awards. This is for an (as typing x[1,] in R shows) a person in a NEW person in a Vocational program and math score of 41.

It is for a NEW person because, as you remember from your reviewing, we do not need probability models to tell us what happened. We just look!

What if we wanted to compare the probability for a person with the same math score but in an Academic vocation? Like this, which is a real pain in the keister:

w = x[12,] # where I hunted for an x that had Academic
w$math = 41
p = MCMCpoisson.pred(fit,w)
plot(p,xlab='num_awards', ylab='Pr(num_awards|D,M)')

I get probability of 0.77 for 0 awards, 0.20 for 1, 0.03 for 2, and near-enough-0 for the rest.

But why can't we do this?

w = x[1,]
w$prog = 'Academic'
p = MCMCpoisson.pred(fit,w)

Because R strips the class of factor from w$prog and turns it into a character. Recall whatever we pass into the prediction functions has to be the same kind of data.frame that went in. The variable prog went in as a factor with 3 levels, and not a character. You might try this:

w = x[1,]
w$prog = 'Academic'
levels(w$prog) = levels(x$prog)

But that turns w$prog to General! I do not know why R strips the class from a factor variable when there is only one row upon substitutions, but it does. For instance, we can do this:

w = x[1:2,]
w$prog[1] = 'Academic'
w = w[-2,]

That works fine: summary(w) shows w$prog is still a three level factor.

So here is the solution:

w = x[1,]
w$prog[1] = 'Academic'
p = MCMCpoisson.pred(fit,w)

That [1] on the second line does the trick! Like I said, pain in the keister.

Okay, let's do our old trick of using all the old data as scenarios. First let's get the probabilities.


# more complicated, because y is open ended; a guess has to be made in 
# number of cols of q; if subscript out of bounds, increase size
q = matrix(0,nrow(x),16)
for(i in 1:nrow(x)){
  a = MCMCpoisson.pred(fit,x[i,])
  q[i,1:length(a)]=a
}

We do not know the limit of the number of awards, but we have to store them. See that "16"? Try putting 8. It will break because some scenarios generate positive (defined as above machine limit) probability for number of awards larger than 7 (recall 0 awards takes a column in q). Of course, you could always put some huge number in place of 16 and never have a problem. Until you want to make pretty plots. As you see next.


par(mfrow=c(2,2))
for (i in levels(x$prog)){
  j = which(x$prog == i)
  plot(0:(dim(q)[2]-1),q[1,],xlab='num_awards', ylab='Pr(num_awards|D,M)', main=i,ylim=c(min(q),max(q)), xlim=c(0,8), type='l')
  for(k in 2:nrow(x)){
    lines(0:(dim(q)[2]-1),q[j[k],])
  }  
}

You can see we limit here the number of awards probabilities to stop at 8 (in the xlim). That obviously has to be changed depending on the application.

Every scenario results in a distribution, probabilities for number of awards from 0 to 8 (max shown). In the scenarios everybody has a program, which are only three, hence three plots. But everybody also has a math score, which obviously varies. That variability does not do too much to change the probability of awards for General or Vocational programs NEW people. But it must for Academic.

The next trick would be to make a plot of only Academic NEW people, where perhaps you color the distributions so that higher math scores are a brighter and brighter red (or whatever color scheme you like).

Math scores goes from 33 to 75. Another fun thing would be to put these scores into discrete decision buckets (which you recall from your review we must always do), and then make the same plots for each of these decision buckets. How? Use the cut function. Do maybe


x$math.cut = cut(x$math, breaks = c(0,45, 52, 59, 80))

which chunks it into the observed quartiles, Do this only if these make sensible decision points. Then you need only change the code above to reflect the new plot, e.g.

...
for (i in levels(x$math.cut)){
j = which(x$math.cut == i)
...

I did it and it worked fine, and showed something interesting. That's your homework!

Next time tobit regression.