How To Do Predictive Statistics: Part II New (Free) Software Regression 1

How To Do Predictive Statistics: Part II New (Free) Software Regression 1

Mandatory! Read Part I. I will ignore all comments already answered in Part I.

Download the code: mcmc.pred.R, mcmc.pred.examples.R. Update If you downloaded before, download again. This is version 0.2!

You must have installed MCMpack. Get it like this:

install.packages('MCMCpack', dependencies=TRUE)

Once installed, every time you start R (if you didn’t save your session), you must load it:

library(MCMCpack)

We’re going to need some data, and though we’re sick of it (I am, anyway), we’ll use the college grade point data, which is the CGPA at the end of the first year for 100 or so kids, along with their SAT (old style) scores, high school GPA, and letter of recommendation strength. We want to peg the probability of certain CGPAs given this information and on the ad hoc assumption we can model uncertainty of CGPA with normal distributions. That will turn out to be, in many cases, a lousy approximation, because CGPA can only live on [0,4], but normals give probability to everything, resulting, as we’ll see, is massive probability leakage. We’ll later fix this problem using multinomial and tobit regression.

I expect everybody reading this will redo the examples on their own favorite datasets. Do so, and report back here.

Read the data in and start the model, using all defaults. You are responsible for figuring out how to use MCMCpack on your own. All we need know is that we’re after this:

     (1) Pr ( CGPA in s | new HGPA & SAT, old data, Model & assumptions),

where s is a set of values of CGPA of interest. Such as “greater than 3” or “equal to 4”, or whatever, and new or assumed values of HGPA and SAT. I don’t care immediately about letters of recommendation.

So

x <- read.csv("https://www.wmbriggs.com/public/cgpa.csv") fit = MCMCregress(cgpa~hgpa*sat,data=x)

Notice we went for the exciting interaction term! Next download and source this code MCMC.pred.R:

# PC
path = 'C:/MyDrive/MyTopFolder/MySubFolder/'
# Linux/MAC
path = '/home/MyName/MyTopFolder/MySubFolder/'

source(paste0(path,'mcmc.pred.R'))

You can use ls() to see all the new code. The bit we will use is:

# Version 0.1
MCMCregress.pred <- function(fit,x){
   # fit from MCMCregress
   # x data.frame in same form as data entered into MCMCregress
   n = length(fit[1,])
   a=formula(attr(fit,'call'))
   y=model.matrix(a,x)
   r = as.numeric(y %*% t(fit[,-n]))
   b = rnorm(length(r)*20,r,sqrt(fit[,n]))
   return(b)
}

This takes the model object fit from MCMCregress, and single new data in the form of a data.frame, and gives us eq. (1). We'll do multiple values later. The structure of the model, at least for MCMCregress, includes the formula, so we don't have to pass it. The row of fit consists in estimates of the model parameters, the last one of which is the spread (or "standard deviation") parameter of the normal distribution.

We grab the model form in the next two lines, which is why we (so far) pass in a data.frame, and then calculate the estimates of the central (or "mean") parameter of the normal distribution of CGPA. We then feed these "mu"s and "sigma"s into the last part of the simulation, which gives us estimates of (1). Estimates of the observable, that is, and not any parameter. We pass all these out.

Notice the "20" in the penultimate line, which is the number of repetitions: notice it is hard coded: notice that this is bad practice: notice how little you are paying. This will be fixed in future versions when the wrappers are written. What's that? What did I mean by wrappers? Ah, so you didn't read Part I. Ah, well.

Update: due to my boneheaded coding error, I used the variance and not standard deviation in the normal simulation. I fixed above and in on-line code, but the results below I did NOT fix yet. I am on the road and will get to it. The code below is right, but the results off.

Let's try:

# Replace x[6,] with any new data that has SAME variables as in model
# the outcome (here cgpa) can be ANY value and is ignored, but must be there. E.g.
# newdata = data.frame(hgpa = 1, sat = 600, recomm = 1, cgpa = 4)
p = MCMCregress.pred(fit,x[6,])

I picked the sixth observation of x to pass in because this is the first number my finger hit. The old data happens to have the proper structure, which the notes say is mandatory. A limitation is that you have to pass in a value of the observable, too, even though it's ignored. That's because the model matrix needs it. That can be fixed in time.

Turns out, for x[6,], hgpa = 0.33, sat = 756, so that p contains the approximation of

     (2) Pr ( CGPA in s | HGPA = 0.33 & SAT = 756, D,M),

where we haven't yet defined s.

Try

# single newdata analysis
hist(p,200)
quantile(p,c(0.05,.5,.95))
mean(p)

I got

> hist(p,200)
> quantile(p,c(0.05,.5,.95))
5% 50% 95%
0.1673601 0.8802050 1.5948456
> mean(p)
[1] 0.8807344

Your results will vary, because this is a simulation approximation. But they shouldn't vary much. If they do, we could boost that "20", or boost mcmc in MCMCregress, or both.

Notice the picture shows modest probability leakage, i.e. positive probability of impossible values, in this case less than 0. Vary hgpa and sat and see if you can find more.

The quantiles show that, for instance, the probability of seeing NEW CGPAs greater than 1.59 are 5%, given these assumed values of HGPA, SAT, and the old data, model and assumptions. So, in this case, s is "greater than 1.59".

Did I mention this was the (conditional) probability of seeing NEW values of CGPA? We do not need models to tell us what happened! If we want to know how many OLD values of CGPA were greater than 1.59, or whatever, we just look. Models are only for telling us about the unknown, not the known.

I cannot stress too highly that this 5% is not "the" probability. There is never, not ever, a case where there is "the" probability. This is the conditional probability as laid out in (2). This true understanding is the biggest departure between the old and predictive ways of modeling. For why it is true, see Uncertainty.

Change anything on the right hand side, and the probability changes. One of the things on the right hand side is the simulation assumptions (and prior and all other model assumptions). That's why, when you run a new simulation, you change the right hand side, and thus you change the probability!

The changes might not be very big---but they are always there (until the trump of doom sounds at Infinity); at least, as long as what is changing is probative to Y. If information on the right hand side is irrelevant to Y, then adding or removing it does nothing. This, after all, is the definition of irrelevancy.

We will use that notion of relevancy next time when we explore what SAT and HGPA, and even letters of recommendation strength do to the model.

4 Comments

  1. Tim Simmons

    In the calculation of b, should it be sqrt(fit[, n]), since MCMCregress gives sigma2 and rnorm expects SD?

  2. Briggs

    Tim,

    Yes! Boneheaded move on my part. Will fix. Thanks.

  3. Instead of P(Y) or P(Y|X) etc., I am now going to use P(Y|X,I’m wearing a red shirt, I ate lunch 5 minutes ago, my name is Justin, it just rained, my dog’s name is Frances, I’m typing these variables, today is Wednesday, …) just to be sure I get the correct scientific probability. I mean, if I change any of those things, clearly the probability will change.

    How come I don’t need to list or pay attention to many or any of those variables to have evidence for P(Heads) ~ .5 when I flip a coin a gazillion times?

    Justin

Leave a Reply

Your email address will not be published. Required fields are marked *