Class - Applied Statistics

How To Do Predictive Statistics: Part VII New (Free) Software Tobit Regression

Previous post in the series (or click above on Class). REVIEW!

Download the code: mcmc.pred.R, mcmc.pred.examples.R. If you downloaded before, download again. This is version 0.3! Only the example code changed since last time.

For an explanation of the theory behind all this, which is markedly different than classical views, get this book: Uncertainty.

Let’s return to the despised CGPA data. I know, I know, but it’s a perfect way to demonstrate tobit regression, which is exactly like ordinary regression except the Y can be censored above or below.

We saw with ordinary regression that there was substantial probability leakage in many scenarios because CGPA can only live on 0 to 4. We had large probabilities for CGPAs both greater than 4 and less than 0. Well, why not disallow it, censoring at these levels?

Start by reading the data back in:

x <- read.csv("https://www.wmbriggs.com/public/cgpa.csv")

As with multinomial we have to track the model formula, a limitation of the MCMCpack.


# cgpa can only exist between 0 and 4; the ordinary regression
# above exhibits strong probability leakage
form = formula('cgpa~hgpa*sat')
fit = MCMCtobit(form, below=0,above=4,data=x, mcmc=30000)

You can see the places for censoring, and also that I upped the MCMC samples a bit. Just to show how it's done.

Okay, now to scenarios? why? Surely you reviewed! Because we want as always:

     Pr (Y | new X, old X&Y, Model & Assumptions)

where here Y = "CGPA = c", X is the HGPA and SAT scores and the model, including its assumptions about priors and so forth, we already discussed.


p=MCMCtobit.pred(fit,form,x[6,],0,4)
hist(p,300)
quantile(p,c(0.05,.5,.95))

I'm not showing the histogram here. You do it.

If you saw an error about truncnorm you will need to install that package with its dependencies, like this:

install.packages('truncnorm', dependencies = TRUE)

Why x[6,] and not x[1,] like we always do? No reason whatsoever. I think I aimed for 1 and hit 6, and was too lazy to change it. Anyway, I got for the quantile function


       5%       50%       95% 
0.2269157 0.8828441 1.6033544 


So there's a 90% chance a NEW person like x[6,] will have a CGPA between 0.23 and 1.6. Your results will differ slightly, as usual. If they differ a lot, increase that mcmc = 30000 in the model fit function.

The 6 person had a SAT of 756 and HGPA of 0.33. What if we wanted probabilities for a fixed scenario? As we've done before


w = x[6,]
w$sat = 1e6
p=MCMCtobit.pred(fit,form,w,0,4)
hist(p,300)
quantile(p,c(0.05,.5,.95))

Gives basically a 100% of CGPA = 4. Everybody who thinks this is the wrong answer, raise their your hands. One, two, ... Hmm. You all fail.

This is the right answer. Even before with probability leakage the model gave the right answer. Unless you have a typo in your code, models always give the right answers! That's because all models are conditional on only the information you provide, which includes information on the model structure. Here three is NO information about limits of SAT scores, only limits on CGPA. So we can specify whatever we like for SAT, even a million.

Outside of this model we know it's BS. But that makes our model assumptions wrong, not our model.

Now that that's understood finally and for all time, let's do our trick of looking at all the scenarios in the data.


# Has to be a loop! array, marray, larray, and etc. all convert
# the data.frame to a matrix or a unusable list! 
# R has no utility to step through a data.frame while mainting class
# all the old x
q = matrix(0,nrow(x),3)
for(i in 1:nrow(x)){
  q[i,]=quantile(MCMCtobit.pred(fit,form,x[i,]),c(0.1,.5,.9))
}

We discussed before why the loop is necessary. You reviewed, from the beginning, so I know you already know. So there was no reason for me to mention it here.

I went with a 80% predictive interval and not 90% as above. Why not? 90% is harsh. Most people for many decisions will be happy with 80% (conditional) certainty. Feel free to change to---AS YOU MUST---for your decisions.

Look at the answers:


plot(x$sat,q[,2],ylab='cgpa',ylim=c(min(q),max(q)))
for(i in 1:nrow(x)){
  lines(c(x$sat[i],x$sat[i]),c(q[i,1],q[i,3]),col=3)
}

# or

plot(x$hgpa,q[,2],ylab='cgpa',ylim=c(min(q),max(q)))
for(i in 1:nrow(x)){
  lines(c(x$hgpa[i],x$hgpa[i]),c(q[i,1],q[i,3]),col=3)
}

Same as we did for ordinary regression---which you remember, since you reviewed. So I won't show those here. You do them.

How about scenarios for probabilities of CGPA greater than 3? Same as before:


# all the old x
q = NA
g = 3
for(i in 1:nrow(x)){
  p = MCMCtobit.pred(fit,form,x[i,])
  q[i]= sum(p>=g)/length(p)
}

# Then

plot(x$sat,q,ylab='Pr(CGPA>g|old data,M)')

# Or

plot(x$hgpa,q,ylab='Pr(CGPA>g|old data,M)')

You can color the dots using hgpa, or put them into decision buckets, or whatever. You can redo the 3D plots, too.

We've reached the end of the pre-packaged MCMpack extensions! Next up, JAGS.

4 replies »

  1. I see Matt sailing through all these analytical tools, one by one, crafting (for free) nuts-and-bolts ways anybody can try his hand at genuine predictive statistics, not having to understand the intricacies of the code or the mathematics at all, pretty much just by hitting ‘Enter’, and sometimes my jaw just drops.

    On a silver platter, he’s handing the possibility of doing predictive statistics (if, by some miracle, they would ever want to) to all those people doing whatever-the-@#$% they think they’re doing in SPSS or whatever.

    And now, just for completeness, and because he can, he’s starting on things I’ve never even heard of (but are apparently used in some fields). JAGS? Wasn’t that a TV show?

    Once again, Matt: congratulations, applause, and thank you.

  2. JohnK,

    I think JAGS means Just Another Gibbs Sampler and, yup, it was a TV show.

    I keep seeing “Tobit” as “two-bit” at least a quarter of the time (* ahem *) which it is not.

  3. I feel I need to know what MCMC is doing and how it does it, at least in outline. Otherwise I’m just clicking the button on the stats package and believing the results, a much-criticized approach.

    Anybody have any good links for that? The package documentation assumes you know and Wikipedia makes no effort at clarity.

Leave a Reply

Your email address will not be published. Required fields are marked *