Class - Applied Statistics

Free Probability-Statistics Class: Predictive Case Study 1, Part IX

Review!

Last time we successfully computed our model (8):

    (8) Pr(CGPA = i | guesses of new measures, grading rules, old obs, model), where i = 0,…,5.

The model we selected was an ad hoc multinomial probit regression, with default prior selected by the MNP package. If we’re happy with that model, we’re done except for exploring other scenarios of SAT and HGPA. We can easily generate all possible combinations of decisionable values.

w1 = seq(400,1500,100) # sat 400 - 1500, every 100
w2 = seq(0,4,1) # hgpa 0 - 4, every 1
y = as.data.frame(expand.grid(w1,w2)) # R loves data frame
y$cgpa = "4" # this automatically creates the right number of cgpas
names(y) = c('sat','hgpa','cgpa') # these have to match spelling as in x!

a=predict(fit, newdata = y, type='prob')$p # takes a couple of seconds to run; assumed fit is in memory!

I’ve decided the decisionable values of SAT and HGPA would be every 100 and 1, but you could easily change these in the seq() function. There’s no easy way to visualize this, since we have 2 measurements taking 60 combinations (at values I chose) in 5 dimensions (CGPA = 0-4). But we can do things like this:

contour(w2,w1,matrix(a[,3],length(w2),length(w1),byrow=TRUE), xlab='SAT',ylab='HGPA', main='Pr(CGPA=2|E)')

# remember a[,1] Pr(CGPA=0|evidence), a[,2] Pr(CGPA=1|evidence), etc.

# Or for something fancier and colorful, try this
#install.packages('plotly', dependencies=TRUE) # if you don't have it; downloads many packages!
library(plotly)

plot_ly(x=w2,y=w1,z=matrix(a[,3],length(w1),length(w2)),type='contour') %>%
  layout(title = 'Pr(CGPA=2|E)',
         xaxis = list(title = 'HGPA'),
         yaxis = list(title = 'Brain Weight (g)'))

Note: Cut and paste everything from plot_ly on down.

These plot the predicted probabilities of CGPA as contours by the values of SAT and HGPA. As with everything, whether these plots are of any value depends on the decisions you will make. They are probably way too specific, especially given the evidence the numerical approximations are only “close enough.” Our knowledge of that numerical approximation is part of our model; I mean, part of the right hand side of (8). Keep that in mind!

One decision is whether SAT or HGPA are relevant, in the presence of the other. If you cannot recall the definition of relevant — review! Let’s check!

fit.null <- mnp(cgpa ~ 1, data=x, burnin = 1000, n.draws=50000)
fit.a <- mnp(cgpa ~ sat , data=x, burnin = 2000,n.draws=50000)
fit.b <- mnp(cgpa ~ hgpa, data=x, burnin = 1000, n.draws=50000)

The fit.null is the model (5) we did many lessons ago, or the form of it as a probit model with all the numerical approximation baggage. It is simply the predictive model not taking account anything but past observations, grading rules, and the model and approximations. SAT and HGPA should be relevant, or else fit.null is good enough.

# we'll reuse the same y as above; measurements not used in the model aren't used in the prediction
a.null=predict(fit.null, newdata = y, type='prob')$p
a.a=predict(fit.a, newdata = y, type='prob')$p
a.b=predict(fit.b, newdata = y, type='prob')$p

Recalling y is of length 60, here are the first 10 results:

> a.null
               0         1         2         3          4
 [1,] 0.05869388 0.1568980 0.5764082 0.1632245 0.04477551
 [2,] 0.06053061 0.1583878 0.5749388 0.1609796 0.04516327
 [3,] 0.05902041 0.1564694 0.5805918 0.1601020 0.04381633
 [4,] 0.06016327 0.1583469 0.5750408 0.1605510 0.04589796
 [5,] 0.06004082 0.1562857 0.5766327 0.1621020 0.04493878
 [6,] 0.05904082 0.1574694 0.5773265 0.1610612 0.04510204
 [7,] 0.05961224 0.1603469 0.5763673 0.1597347 0.04393878
 [8,] 0.05844898 0.1585918 0.5771224 0.1618367 0.04400000
 [9,] 0.06110204 0.1580816 0.5772041 0.1587347 0.04487755
[10,] 0.06006122 0.1548571 0.5791224 0.1607755 0.04518367

This shows you the limits of the numerical approximation technique. All of the numbers in each column should be the same, since this is the predicted probability of CGPA unconditioned on SAT and HGPA. They're not wildly variable in the context of most decisions I can think of, but the variability is there. We could average each column and reduce some of the approximation error (why this is so I won't prove, but it should be intuitive enough). Mixing code and output:

a.null = colMeans(a.null)

> a.null
         0          1          2          3          4 
0.05942313 0.15791837 0.57627857 0.16167925 0.04470068 

So, according to (8a), the probability of CGPA = 2 is highest at 58%. Column 1 of y repeats the sequence w1 length(w2) times. It does this because we were lazy and didn't want to create a new y for every model, content to use the original y with all possible (decisionable) combinations of SAT and HGPA. If you look at each of the blocks of SAT from 400 to 1500, you'll again see variability where none should ideally be. We can average the blocks to reduce variability if we want to be clever. This is getting tedious because we could have created new y, but on the other hand, we would not have seen the variability of the numerical approximation. Thus our laziness has given us (theoretically, anyway) superior guesses.

# hack hack hack
# hack, but it should work whatever levels cgpa has, and whatever breakdowns of w1 and w2
# hack hack hack

b.1 = matrix(1:dim(y)[1],length(w1),length(levels(x$cgpa)))
b.2 = b.1 *0
for(i in 1:length(w2)){
  b.2 = b.2 + a.a[((i-1)*length(w1)+1):((i-1)*length(w1)+length(w1)), ]
}
b.2 = b.2/length(w2)
a.a = b.2 # replacement happens here!

b.1 = matrix(1:dim(y)[1],length(w2),length(levels(x$cgpa)))
b.2 = b.1 *0
for(i in 1:length(w1)){
  b.2 = b.2 + a.b[((i-1)*length(w2)+1):((i-1)*length(w2)+length(w2)), ]
}
b.2 = b.2/length(w1)
a.b = b.2 # replacement happens here!

Now we use length(levels(x$cgpa)) so often, it would be better to create a variable for it, and use it in all the code. I didn't. You do it as homework. As I said at the outset, this code is not optimized. It is written to be plain and simple to follow. Now let's check relevance of SAT!

plot(w1,a.a[,1],type='b',col = 1, xlab='SAT', ylab='Pr')
for (i in 2:length(levels(x$cgpa))){
 lines(w1,a.a[,i],type='b',col = i)
}
points(rep(w1[1],length(levels(x$cgpa))), a.null,pch=15, col=1:length(levels(x$cgpa)),cex=2 )
legend('topright',levels(x$cgpa),lty=1,col=1:length(levels(x$cgpa)), bty='n')

This shows the predicted probabilities of each level of CGPA for the decisionable levels of SAT we picked. It also shows, at the far left in colored boxes corresponding to the colored lines, the predicted probabilities of the null model. The box probabilities differ from the line probabilities, so SAT is relevant. It is also in the causal chain, as we decided earlier. So this model, call it (8b) is different from (8a), the null model. (Note I do not use "null" in quite the same way as classical statistics.) The last check is whether the probabilities are different enough that our decisions would change. I say they are, because I am the decision maker. You might say, with your decisions in mind, the differences are not big enough. There is no universal magic number!

Now HGPA:

plot(w2,a.b[,1],type='b',col = 1, xlab='HGPA', ylab='Pr',ylim=c(0,.6))
for (i in 2:length(levels(x$cgpa))){
  lines(w2,a.b[,i],type='b',col = i)
}
points(rep(w2[1],length(levels(x$cgpa))), a.null,pch=15, col=1:length(levels(x$cgpa)),cex=2 )
legend('topright',levels(x$cgpa),lty=1,col=1:length(levels(x$cgpa)), bty='n')

Notice I altered the default y-axis limits. The line probabilities are different from the boxed ones. So HGPA is relevant. But they aren't terribly different. Regardless of the value of HGPA the probabilities for CGPA are mostly flat. The only "big" discrepancy (where "big" relates to a decision) is the probability of CGPA = 2. That difference alone is decision-worthy, to coin a phrase. But it makes me wonder how useful HGPA is with SAT in the model. Back to the contour plots!

Which we'll do next time.

Homework See if you can figure out how to make the 2-D relevance plots. Remember the one above is only at one level of CGPA. Is HGPA still relevant? Yes. But would you, putting yourself in the mind of a Dean, find it useful to decisions?

4 replies »

  1. I own your book, follow you on Twitter, check this site daily and have but one wish; that’d you make your R code easier to implement. Use the Force Briggs. Use the Force and slay the cut ‘n paste Death Star.

  2. EO,

    Thanks! And there’s so little code at this point it scarcely matters. And don’t forget the class code file will be continuously updated. This code is for pedantic purposes, not application development. I’ll be making it downloadable.

  3. Again, according to your definition of relevance, a definition based on observations not theory, there is no need to run any statistical analysis to see whether a variable is relevant. Shoe size or hair color or whatever variable would always be relevant. It is a redundant question.

    Think of the difference between theory and reality, statistics-wise.

    Forget the fact that you have tempered with the data, please read the following for the difference between multinomial probit (MNP) model and ordinal MNP model, and conclude whether the model of YOUR likeness is optimal.
    https://cran.r-project.org/web/packages/MNP/vignettes/MNP.pdf

    The last check is whether the probabilities are different enough that our decisions would change. I say they are, because I am the decision maker. You might say, with your decisions in mind, the differences are not big enough. There is no universal magic number!

    Is this good or bad? For your distretized data, one can easily use a contingency table (aka, a cross tabulation or crosstab) to obtain similar results. One would ask why bother hiring a statistician.

  4. “to your liking” not “of your likeness.” ???

    English is a wonderful language. Prepositions and suffixes.

Leave a Reply

Your email address will not be published. Required fields are marked *