# Category: Class – Applied Statistics

February 7, 2018 | 1 Comment

## Free Probability-Statistics Class: Predictive Case Study 1, Part XI

Review!

We left off with comparing the standard, out-of-the-box linear regression with our multinomial predictive observable model. The great weaknesses of the regression were probability leakage (giving positive probability to impossible values) and that the normal gives densities and not probabilities. We can’t fix the leakage with this model (it’s a built-in shortcoming of this model), but we can generate probabilities.

Now the densities are predictions from the normal regression model, but they are not in a form that can be used. In order to create probabilities from densities, we need to make a decision. The densities are of course easily transformed into cumulative distributions, which are probabilities, but they will give positive probabilities to an infinity of results (all numbers along the continuum). We only care about our decision points, which for our fictional Dean are the five values 0, 1, 2, 3, 4.

The decision we need to make is in how to cut the infinity into 5 blocks. There is of course no unique way to do this. But it may be reasonable to cut the values at the midpoints of the five values. For example, given the regression, the decision probability of having a CGPA = 0 would be the regression probability of being between 0 and 0.05. For 1, it would be 0.5 to 1.5, and so on. That’s easy to do.

Mixing code and output in the obvious way:

```d=as.numeric(levels(x\$cgpa))
caps = d[-length(d)] + diff(d)/2
caps = c(min(d),caps,max(d))

# output
> caps
[1] 0.0 0.5 1.5 2.5 3.5 4.0
```

I used `caps` because R has a native `cut` function inconvenient here (it outputs factors, and we want numbers). Let’s next recreate the picture of the predictions for good and bad students. Then we’ll add on top of it the regression probabilities.

```y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4)) #bad and good student
a=predict(fit, newdata = y, type='prob')\$p #our mulitnomial moels

plot(levels(x\$cgpa), a[1,],type='h',xlab='College GPA',ylab='Probability',ylim=c(0,1.1*max(a)),lwd=5)
lines(levels(x\$cgpa), a[2,],type='h',lty=2,col=2,lwd=3)
grid()
legend('topright', c("SAT =  400, HGPA = 1","SAT = 1500, HGPA = 4"),lty=1:2,col=1:2,bty='n',lwd=2)

s1 = obs.glm(fit.n,y[1,]) # the prediction from the regression
s2 = obs.glm(fit.n,y[2,])

m = s1\$central-s2\$central  # plotting up the probability DENSITIES, NOT PROBABILITIES
s = sqrt(s1\$scale^2+s2\$scale^2)
w = seq(min(s1\$central,s2\$central)-3*s, max(s1\$central,s2\$central)+3*s, length.out = 100)
lines(w,.7*dt((w-s1\$central)/s1\$scale,s1\$df)/s1\$scale,lty=1,lwd=2)
lines(w,.7*dt((w-s2\$central)/s2\$scale,s2\$df)/s2\$scale,lty=2,col=2,lwd=2)

# regression probabilities at the caps
p = matrix(0,2,length(levels(x\$cgpa))) # storage
for (i in 1:2){
b=obs.glm(fit.n,y[i,])
d=pnorm(caps,b\$central,b\$scale) # the predictions
p[i,] = diff(d) # d[1] prob < 0; 1-(sum(diff(d))+d[1]) prob >4
lines(as.numeric(levels(x\$cgpa))+.05,p[i,],col=i+2,type='h',lty=i,lwd=3)
}

```

Green solid is for the bad student, regression model; blue dashed is for the good student, regression model. You can see the spikes follow, more or less, the densities. The black solid is our original multinomial model for the bad student, and the red dashed the good student.

Which model is better? There is no way to tell, not with this plot. They are simply two different predictions from two different models. The only way to gauge goodness is to—all together now—wait for new data which has never been seen or used in any way, and then compare both models’ predictions against what happened.

Nevertheless, we can gather clues and build another, different predictive model, which predicts how well our original models will perform. Be sure you understand the distinction! We have an original predictive model. Then we create a model on how well this original predictive model will perform. These are not the same models! This right here is another major departure of the predictive method over classical counterparts.

Let’s first look at the regression predictions a little more closely (mixing code and output):

```> p
[,1]        [,2]       [,3]        [,4]         [,5]
[1,] 2.152882e-01 0.568404758 0.12265228 0.002511169 4.024610e-06
[2,] 1.109019e-06 0.001036698 0.07554629 0.511387677 2.646423e-01

> rowSums(p)
[1] 0.9088604 0.8526141
```

The first row is the bad student (low SAT and HGPA) at each of the five CGPAs (here labeled by their matrix notation), and the second the good student (high SAT and HGPA). The `rowSums(p)` gives the total probability of all possibilities. This should be 1 (right?). It isn’t; it is less, and that is because of probability leakage.

You can see that leakage is conditional on the assumptions, just like probability. Leakage isn’t constant. (It also depends on how we define our caps/cut points.) We earlier guessed, looking only at the densities, that it wasn’t that bad. But it turns out to be pretty large. We’re missing about 10% probability from the first prediction, and about 15% from the second. These are the probabilities for impossible events.

The leakage does not mean the model is useless, but it is evidence the model is sub optimal. It also does not mean our multinomial model is better, but at least our multinomial model does not have leakage. The multinomial model, for instance, can give large probabilities to events that did not happen, while the leakage regression model gives decent probabilities to what happened. We’ll have to see.

And that’s what we’ll do next time. It’s too much to try and begin formal verification this week.

Homework Assume the Dean wants to do CGPAs at 0, 0.5, …, 3.5, and 4. Rerun the code from the beginning, and end with the plot seen above, complete with the regression model using the obvious cut points. Notice anything different?

January 31, 2018 | 4 Comments

## Free Probability-Statistics Class: Predictive Case Study 1, Part X

Review!

Last time we created four models of CGPA. Which is correct? They all are. Why? I should ask as a homework question, but I’ll remind us here. Since all probability is conditional, and these are all ad hoc models, all are correct, given the assumptions. Which one is best? Depends on the decisions you want to make and on what you mean by “best.” Let’s see.

We discovered last time that by itself HGPA was relevant to CGPA, but not by much. SAT was also by itself relevant. The contour plots (which I have decided not to redo, since we did them last week) showed that SAT and HGPA when considered together are also relevant. We also created the “null” model (remember our terminology does not match the classical usage) which only used past data (and grading rules, etc.). We now have to see how useful each of these models are. (If you can’t remember what all these terms mean — review!)

In one sense, we cannot do too much more because we have the models and have made predictions using them. That was our goal, remember? Now we have to sit back and wait for new values of HGPA/SAT and CGPA come in. Then we can see how each of these models’ predictions match reality. This is the True Test.

Here, for instance, is an excerpt of predictions of the full model (HGPA/SAT) (I’m assuming we left off just where we were last time, so that `y` and `a` are still in R’s memory; if not, go over last week’s code):

```> cbind(y[,1:2],round(a,2))
sat hgpa    0    1    2    3    4
1   400    0 0.55 0.26 0.19 0.00 0.00
2   500    0 0.52 0.27 0.21 0.01 0.00
3   600    0 0.47 0.28 0.24 0.01 0.00
4   700    0 0.42 0.30 0.28 0.01 0.00
5   800    0 0.35 0.32 0.32 0.01 0.00
6   900    0 0.28 0.33 0.37 0.01 0.00
7  1000    0 0.22 0.35 0.41 0.02 0.00
8  1100    0 0.16 0.36 0.45 0.03 0.01
9  1200    0 0.11 0.36 0.48 0.04 0.01
10 1300    0 0.08 0.34 0.51 0.06 0.02
11 1400    0 0.05 0.31 0.52 0.08 0.03
12 1500    0 0.04 0.28 0.53 0.11 0.04
13  400    1 0.52 0.24 0.23 0.01 0.00
14  500    1 0.46 0.25 0.27 0.01 0.00
15  600    1 0.40 0.27 0.32 0.01 0.00
16  700    1 0.32 0.29 0.37 0.02 0.00
...
57 1200    4 0.00 0.04 0.22 0.58 0.15
58 1300    4 0.00 0.04 0.18 0.63 0.15
59 1400    4 0.00 0.04 0.14 0.66 0.15
60 1500    4 0.00 0.05 0.12 0.69 0.14
```

We could publish this, or the whole table, and we’d be done! Anybody could take these predictions and implement them. They wouldn’t have to know the details of your model, or of your original data. There is your bold theory, exposed for the world to see! That, after all, is how science should work.

Of course, all the shortcomings of your model will be obvious to anybody who tries to use it, too. Which, again, is just how it should be.

Make sure you understand what we’ve done so far. If you are the Dean and want to classify students into one of five CGPA buckets, we have a predictive model accounting for HGPA and SAT. But suppose you didn’t want to account for HGPA. Well, we have a model of just SAT: use that. And so on. The breakdowns of SAT (every 100) and HGPA (every 1) we used were also geared to the decision. Change the decision, change the breakdown, change the model.

In any case, this is it! This is what we wanted. We wanted to know, given the grading rules and old obs, and values of SAT and HPGA, what is the probability of having CGPA in one of the buckets. And this is what we did! We are done. All those people who wanted practical examples of the predictive way, well, here you go. In the end, it’s pretty simple, isn’t it?

But we can do two more things. (1) We can compare our predictive model (perhaps varying it here and there) with old-fashioned NHST/parameter-estimating models, and (2) we can create a new model that predicts the performance our current model. Number (2) is the really important step, but we won’t get to it today. Let’s do (1).

What model would most use in this situation? A linear regression. Here it is (mixing code and output again). The `cgpa.o` was the original CPGA, not classified into buckets. It is the raw score (the “o” is for original).

```fit.n= glm(cgpa.o ~ sat + hgpa , data=x) # note the cgpa.o! which is the original data
summary(fit.n)

Call:
glm(formula = cgpa.o ~ sat + hgpa, data = x)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-1.12153  -0.44120   0.00954   0.38198   1.80356

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0881312  0.2866638  -0.307 0.759169
sat          0.0012167  0.0003011   4.041 0.000107 ***
hgpa         0.4071133  0.0905946   4.494 1.94e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Wee p-values galore! The asterisks give the glory. Feel it. So, the classical statistician would say, SAT and HPGA are “linked to” CGPA. Researchers will speak of “statistical significance” and say “SAT drives high college grade point” and so on. How much does SAT influence (they’ll say “impact”) CGPA? Well, they might say for every increase in SAT by 1, CGPA goes up on average 0.0012 points. And so on.

Not much more would be done with this model, especially since everything is “significant”. Maybe the modeler throws in an interaction. Whatever. No matter what, this model exaggerates the evidence we have, and is in substantial error, even though it doesn’t look like it. Here’s why.

This model implies a prediction: all models imply predictions, even though they are not routinely made. It’s written in classical form, but the prediction is there, hidden away. Let’s look at it. We do so by integrating out the parameters, picking a “flat” prior, which, for this model anyway, gives us the exact same results for the parameters as the frequentist estimates.

Recall `fit` is our mutlinomial (like) model, including both SAT and HGPA. Let’s pick two hypothetical students, a good one and a bad one, and compare our predictive model with the prediction based on the ordinary regression. Before running this, first re-download the class code, which has been updated to include the code which calculates probability predictions from normal regression models. This is the `obs.glm`, which outputs a central, spread, and degrees-of-freedom parameter for the predictive distribution of the regression (this turns out to be a non-central T).

```y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4)) #bad and good student
a=predict(fit, newdata = y, type='prob')\$p #our mulitnomial moels

plot(levels(x\$cgpa), a[1,],type='h',xlab='College GPA',ylab='Probability',ylim=c(0,1.1*max(a)),lwd=5)
lines(levels(x\$cgpa), a[2,],type='h',lty=2,col=2,lwd=3)
grid()
legend('topright', c("SAT =  400, HGPA = 1","SAT = 1500, HGPA = 4"),lty=1:2,col=1:2,bty='n',lwd=2)

s1 = obs.glm(fit.n,y[1,]) # the prediction from the regression
s2 = obs.glm(fit.n,y[2,])

m = s1\$central-s2\$central  # plotting up the probability DENSITIES, NOT PROBABILITIES
s = sqrt(s1\$scale^2+s2\$scale^2)
w = seq(min(s1\$central,s2\$central)-3*s, max(s1\$central,s2\$central)+3*s, length.out = 100)
lines(w,.7*dt((w-s1\$central)/s1\$scale,s1\$df)/s1\$scale,lty=1,lwd=2)
lines(w,.7*dt((w-s2\$central)/s2\$scale,s2\$df)/s2\$scale,lty=2,col=2,lwd=2)
```

Our multinomial-like model gives the spikes; the regression densities, which are not probabilities, are the curves. We’ll fix the densities into probabilities later. But it’s densities, because the normal lives on the continuum, and so does the predictive distribution from the normal (the t). Notice anything odd? The regression gives probabilities to impossible values, i.e. CGPA less than 0 and greater than 4. I call this probability leakage.

It’s not terrible here, but it does exist. It means the model is predicting impossibilities. The standard regression model is at the least inefficient. Interestingly, this leakage model cannot be falsified. It gives positive probability to events we’ll never see, but it never gives 0 probability anywhere! Falsifiability isn’t that interesting.

That’s enough for this time. Next time, we turn the densities into probabilities, and make modifications to our multinomial model.

Homework Try the normal predictions for the singular models with SAT and HGPA alone, and for the “null” model, which is had by `glm(cgpa.o ~ 1 , data=x)`. Which has the most leakage? Are there huge differences?

Because you asked You can now download `briggs.homework.R`, which contains all the code used in the lectures. Note that this is different than `briggs.class.R`, which can be treated like black-box helper code.

January 23, 2018 | 4 Comments

## Free Probability-Statistics Class: Predictive Case Study 1, Part IX

Review!

Last time we successfully computed our model (8):

(8) Pr(CGPA = i | guesses of new measures, grading rules, old obs, model), where i = 0,…,5.

The model we selected was an ad hoc multinomial probit regression, with default prior selected by the MNP package. If we’re happy with that model, we’re done except for exploring other scenarios of SAT and HGPA. We can easily generate all possible combinations of decisionable values.

```w1 = seq(400,1500,100) # sat 400 - 1500, every 100
w2 = seq(0,4,1) # hgpa 0 - 4, every 1
y = as.data.frame(expand.grid(w1,w2)) # R loves data frame
y\$cgpa = "4" # this automatically creates the right number of cgpas
names(y) = c('sat','hgpa','cgpa') # these have to match spelling as in x!

a=predict(fit, newdata = y, type='prob')\$p # takes a couple of seconds to run; assumed fit is in memory!
```

I’ve decided the decisionable values of SAT and HGPA would be every 100 and 1, but you could easily change these in the `seq()` function. There’s no easy way to visualize this, since we have 2 measurements taking 60 combinations (at values I chose) in 5 dimensions (CGPA = 0-4). But we can do things like this:

```contour(w2,w1,matrix(a[,3],length(w2),length(w1),byrow=TRUE), xlab='SAT',ylab='HGPA', main='Pr(CGPA=2|E)')

# remember a[,1] Pr(CGPA=0|evidence), a[,2] Pr(CGPA=1|evidence), etc.

# Or for something fancier and colorful, try this
library(plotly)

plot_ly(x=w2,y=w1,z=matrix(a[,3],length(w1),length(w2)),type='contour') %>%
layout(title = 'Pr(CGPA=2|E)',
xaxis = list(title = 'HGPA'),
yaxis = list(title = 'Brain Weight (g)'))
```

Note: Cut and paste everything from `plot_ly` on down.

These plot the predicted probabilities of CGPA as contours by the values of SAT and HGPA. As with everything, whether these plots are of any value depends on the decisions you will make. They are probably way too specific, especially given the evidence the numerical approximations are only “close enough.” Our knowledge of that numerical approximation is part of our model; I mean, part of the right hand side of (8). Keep that in mind!

One decision is whether SAT or HGPA are relevant, in the presence of the other. If you cannot recall the definition of relevant — review! Let’s check!

```fit.null <- mnp(cgpa ~ 1, data=x, burnin = 1000, n.draws=50000)
fit.a <- mnp(cgpa ~ sat , data=x, burnin = 2000,n.draws=50000)
fit.b <- mnp(cgpa ~ hgpa, data=x, burnin = 1000, n.draws=50000)
```

The `fit.null` is the model (5) we did many lessons ago, or the form of it as a probit model with all the numerical approximation baggage. It is simply the predictive model not taking account anything but past observations, grading rules, and the model and approximations. SAT and HGPA should be relevant, or else `fit.null` is good enough.

```# we'll reuse the same y as above; measurements not used in the model aren't used in the prediction
a.null=predict(fit.null, newdata = y, type='prob')\$p
a.a=predict(fit.a, newdata = y, type='prob')\$p
a.b=predict(fit.b, newdata = y, type='prob')\$p
```

Recalling `y` is of length 60, here are the first 10 results:

```> a.null
0         1         2         3          4
[1,] 0.05869388 0.1568980 0.5764082 0.1632245 0.04477551
[2,] 0.06053061 0.1583878 0.5749388 0.1609796 0.04516327
[3,] 0.05902041 0.1564694 0.5805918 0.1601020 0.04381633
[4,] 0.06016327 0.1583469 0.5750408 0.1605510 0.04589796
[5,] 0.06004082 0.1562857 0.5766327 0.1621020 0.04493878
[6,] 0.05904082 0.1574694 0.5773265 0.1610612 0.04510204
[7,] 0.05961224 0.1603469 0.5763673 0.1597347 0.04393878
[8,] 0.05844898 0.1585918 0.5771224 0.1618367 0.04400000
[9,] 0.06110204 0.1580816 0.5772041 0.1587347 0.04487755
[10,] 0.06006122 0.1548571 0.5791224 0.1607755 0.04518367
```

This shows you the limits of the numerical approximation technique. All of the numbers in each column should be the same, since this is the predicted probability of CGPA unconditioned on SAT and HGPA. They're not wildly variable in the context of most decisions I can think of, but the variability is there. We could average each column and reduce some of the approximation error (why this is so I won't prove, but it should be intuitive enough). Mixing code and output:

```a.null = colMeans(a.null)

> a.null
0          1          2          3          4
0.05942313 0.15791837 0.57627857 0.16167925 0.04470068
```

So, according to (8a), the probability of CGPA = 2 is highest at 58%. Column 1 of `y` repeats the sequence `w1` length(w2) times. It does this because we were lazy and didn't want to create a new `y` for every model, content to use the original y with all possible (decisionable) combinations of SAT and HGPA. If you look at each of the blocks of SAT from 400 to 1500, you'll again see variability where none should ideally be. We can average the blocks to reduce variability if we want to be clever. This is getting tedious because we could have created new `y`, but on the other hand, we would not have seen the variability of the numerical approximation. Thus our laziness has given us (theoretically, anyway) superior guesses.

```# hack hack hack
# hack, but it should work whatever levels cgpa has, and whatever breakdowns of w1 and w2
# hack hack hack

b.1 = matrix(1:dim(y)[1],length(w1),length(levels(x\$cgpa)))
b.2 = b.1 *0
for(i in 1:length(w2)){
b.2 = b.2 + a.a[((i-1)*length(w1)+1):((i-1)*length(w1)+length(w1)), ]
}
b.2 = b.2/length(w2)
a.a = b.2 # replacement happens here!

b.1 = matrix(1:dim(y)[1],length(w2),length(levels(x\$cgpa)))
b.2 = b.1 *0
for(i in 1:length(w1)){
b.2 = b.2 + a.b[((i-1)*length(w2)+1):((i-1)*length(w2)+length(w2)), ]
}
b.2 = b.2/length(w1)
a.b = b.2 # replacement happens here!
```

Now we use `length(levels(x\$cgpa))` so often, it would be better to create a variable for it, and use it in all the code. I didn't. You do it as homework. As I said at the outset, this code is not optimized. It is written to be plain and simple to follow. Now let's check relevance of SAT!

```plot(w1,a.a[,1],type='b',col = 1, xlab='SAT', ylab='Pr')
for (i in 2:length(levels(x\$cgpa))){
lines(w1,a.a[,i],type='b',col = i)
}
points(rep(w1[1],length(levels(x\$cgpa))), a.null,pch=15, col=1:length(levels(x\$cgpa)),cex=2 )
legend('topright',levels(x\$cgpa),lty=1,col=1:length(levels(x\$cgpa)), bty='n')
```

This shows the predicted probabilities of each level of CGPA for the decisionable levels of SAT we picked. It also shows, at the far left in colored boxes corresponding to the colored lines, the predicted probabilities of the null model. The box probabilities differ from the line probabilities, so SAT is relevant. It is also in the causal chain, as we decided earlier. So this model, call it (8b) is different from (8a), the null model. (Note I do not use "null" in quite the same way as classical statistics.) The last check is whether the probabilities are different enough that our decisions would change. I say they are, because I am the decision maker. You might say, with your decisions in mind, the differences are not big enough. There is no universal magic number!

Now HGPA:

```plot(w2,a.b[,1],type='b',col = 1, xlab='HGPA', ylab='Pr',ylim=c(0,.6))
for (i in 2:length(levels(x\$cgpa))){
lines(w2,a.b[,i],type='b',col = i)
}
points(rep(w2[1],length(levels(x\$cgpa))), a.null,pch=15, col=1:length(levels(x\$cgpa)),cex=2 )
legend('topright',levels(x\$cgpa),lty=1,col=1:length(levels(x\$cgpa)), bty='n')
```

Notice I altered the default y-axis limits. The line probabilities are different from the boxed ones. So HGPA is relevant. But they aren't terribly different. Regardless of the value of HGPA the probabilities for CGPA are mostly flat. The only "big" discrepancy (where "big" relates to a decision) is the probability of CGPA = 2. That difference alone is decision-worthy, to coin a phrase. But it makes me wonder how useful HGPA is with SAT in the model. Back to the contour plots!

Which we'll do next time.

Homework See if you can figure out how to make the 2-D relevance plots. Remember the one above is only at one level of CGPA. Is HGPA still relevant? Yes. But would you, putting yourself in the mind of a Dean, find it useful to decisions?

January 16, 2018 | 4 Comments

## Free Data Science Class: Predictive Case Study 1, Part VIII

Review!

We’re continuing with the CGPA example. The data is on line, and of unknown origin, but good enough to use as an example.

We will build a correlational model, keeping ever in mind this model’s limitations. It can say nothing about cause, for instance.

As we discussed in earlier lessons, the model we will build is in reference to the decisions we will make. Our goal in this model is to make decisions regarding future students’ CGPAs given we have guesses or know their HGPA, SAT, and possibly hours spend studying. We judge at least the first two in the causal path of CGPA. Our initial decision cares about getting CGPA to the nearest point (if you can’t recall why this is most crucially important — review!).

It would be best if we extended our earlier measurement-deduced model, so that we have the predictive model from the get go (if you do not remember what this means — review!). But that’s hard, and we’re lazy. So we’ll do what everybody does and use an ad hoc parameterized model, recognizing that all parameterized models are always approximations to the measurement reality.

Because this is an ad hoc parameterized model, we have several choices. Every choice is in response to a premise we have formed. Given “I quite like multinomial logistic regression; and besides, I’ve seen it used before so I’m sure I can get it by an editor”, then the model is in our premises. All probability follows on our assumptions.

Now the multinomial logistic regression forms a parameter for every category—here we have 5, for CGPA = 0-5—and says those parameters are functions of parameterized measurements in a linear way. The math of all this is busy, but not too hard. Here is one source to examine the model in detail.

For instance, the parameter for CGPA = 0 is itself said to be a linear function of parameterized HGPA and SAT.

These parameters do not exist, give no causal information, and are of no practical interest (no matter how interesting they are mathematically). For instance, they do not appear in what we really want, which is this:

(8) Pr(CGPA = i | guesses of new measures, grading rules, old obs, model), where i = 0,…,5.

We do not care about the parameters, which are only mathematical entities needed to get the model to work. But because we do not know the value of the parameters, the uncertainty in them, as it were, has to be specified. That is, a “prior” for them must be given. If we choose one prior, (8) will given one answer; if we choose a different prior, (8) will (likely) give a different answer. Same thing if we choose a different parameterized model: (8) will give different answers. This does not worry us because we remember all probability is conditional on the assumptions we make. CGPA does not “have” a probability! Indeed, the answers (8) gives using different models are usually much more varied than the answers given using the same model but different priors.

What prior should we use? Well, we’re lazy again. We’ll use whatever the software suggests, remembering other choices are possible.

Why not use the MNP R Package for “Fitting the Multinomial Probit Model”? But, wait. Probit is not the same as Logit. That’s true, so let’s update our ad hoc premise to say we really had in mind a multinomial probit model. If you do not have MNP installed, use this command, and follow the subsequent instructions about choosing a mirror.

```install.packages('MNP', dependencies = TRUE)
```

There are other choices beside MNP, but unfortunately the software for multinomial regressions is not nearly as developed and as bullet proof as for ordinary regressions. MNP gives the predictive probabilities we want. But we’ll see that it can break. Beside that, our purpose is to understand the predictive philosophy and method, not to tout for a particular ad hoc model. What happens below goes for any model that can be put in the form of (8). This includes all machine learning, AI, etc.

The first thing is to ensure you have downloaded the data file `cgpa.csv`, and also the helper file `briggs.class.R`, which contains code we’ll use in this class. Warning: this file is updated frequently! For all the lawyers, I make no guarantee about this code. It might even destroy your computer, cause your wife to leave you, and encourage your children to become lawyers. Use at your own risk. Ensure Windows did not change name of `cgpa.csv` to `cgpa.csv.txt`.

Save the files in a directory you create for the class. We’ll store that directory in the variable `path`. Remember, `#` comments out the rest of what follows on a line.

```path = 'C:/Users/yourname/yourplace/' # for Windows
#path = '/home/yourname/yourplace/' # for Apple, Linux
# find the path to your file by looking at its properties
# everything in this class is in the same directory

source(paste(path,'briggs.class.R',sep='')) # runs the class code

x\$cgpa.o = x\$cgpa # keeps an original copy of CGPA
x\$cgpa = as.factor(roundTo(x\$cgpa,1)) # rounds to nearest 1

summary(x)
table(x\$cgpa)
```

You should see this:

```>  summary(x)
cgpa        hgpa            sat           recomm          cgpa.o
0: 4   Min.   :0.330   Min.   : 400   Min.   : 2.00   Min.   :0.050
1:17   1st Qu.:1.640   1st Qu.: 852   1st Qu.: 4.00   1st Qu.:1.562
2:59   Median :1.930   Median :1036   Median : 5.00   Median :1.985
3:16   Mean   :2.049   Mean   :1015   Mean   : 5.19   Mean   :1.980
4: 4   3rd Qu.:2.535   3rd Qu.:1168   3rd Qu.: 6.00   3rd Qu.:2.410
Max.   :4.250   Max.   :1500   Max.   :10.00   Max.   :4.010
> table(x\$cgpa)

0  1  2  3  4
4 17 59 16  4
```

The measurement `recomm` we’ll deal with later. Next, the model.

```require(MNP) # loads the package

fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=2000)
#fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=10000)
```

The model call is obvious enough, even if `burnin = 2000, n.draws=2000` is opaque.

Depending on your system, the model fit might break. You might get an odd error message ("TruncNorm: lower bound is greater than upper bound") about inverting a matrix which you can investigate if you are inclined (the problem is in a handful of values in `sat`, and how the model starts up). This algorithm uses MCMC methods, and therefore cycles through a loop of size `n.draws`. All we need to know about this (for now) is that because this is a numerical approximation, larger numbers give less sloppy answers. Try `n.draws=10000`, or even five times that, if your system allows you to get away with it. The more you put, the longer it takes.

We can look at the output of the model like this (this is only a partial output):

```> summary(fit)

Call:
mnp(formula = cgpa ~ sat + hgpa, data = x, n.draws = 50000, burnin = 2000)

Coefficients:
mean   std.dev.       2.5%  97.5%
(Intercept):1 -1.189e+00  2.143e+00 -7.918e+00  0.810
(Intercept):2 -1.003e+00  1.709e+00 -5.911e+00  0.664
(Intercept):3 -8.270e+00  3.903e+00 -1.630e+01 -1.038
(Intercept):4 -2.297e+00  3.369e+00 -1.203e+01 -0.003
sat:1          9.548e-04  1.597e-03 -3.958e-04  0.006
sat:2          1.065e-03  1.488e-03 -7.126e-06  0.005
sat:3          4.223e-03  2.655e-03  2.239e-05  0.010
sat:4          1.469e-03  2.202e-03  1.704e-06  0.008
hgpa:1         9.052e-02  3.722e-01 -5.079e-01  0.953
hgpa:2         1.768e-01  3.518e-01 -2.332e-01  1.188
hgpa:3         1.213e+00  6.610e-01  1.064e-01  2.609
hgpa:4         3.403e-01  5.242e-01 -7.266e-04  1.893
```

The `Coefficients` are the parameters spoken of above. The `mean` etc. are the estimates of these unobservable, not-very-interesting entities. Just keep in mind that because a coefficient is large, does not mean its effect on the probability of CGPA = i is itself large.

We do care about the predictions. We want (8), so let's get it. Stare at (8). On the right hand side we need to guess values of SAT and HGPA for a future student. Let's do that for two students, one with a low SAT and HGPA, and another with high values. You shouldn't have to specify values of CGPA, since these are what we are predicting, but that's a limitation of this software.

```y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4))
a=predict(fit, newdata = y, type='prob')\$p
```

The syntax is decided by the creators of the MNP package. Anyway, here's what I got. You will NOT see the exact same numbers, since the answers are helter-skelter numerical approximations, but you'll be close.

```> a
0          1         2      3            4
[1,] 0.519000 0.24008333 0.2286875 0.0115 0.0007291667
[2,] 0.000125 0.04489583 0.1222917 0.6900 0.1426875000
```

There are two students, so two rows of predictions for each of the five categories. This says, for student `(sat=400, hgpa=1)`, he'll most like see a CGPA = 0. And for `(sat=1500, hgpa=4)`, the most likely is a CGPA = 3. You can easily play with other scenarios. But, and this should be obvious, if (8) was our goal, we are done!

Next time we'll build on the scenarios, explore this model in more depth, and compare our model with classical ones.

Homework Play with other scenarios. Advanced students can track down the objectionable values of `sat` that cause grief in the model fit (I wrote a script to do this, and known which ones they are). Or they can change the premises, by changing the starting values of the parameters. We didn't do that above, because most users will never do so, relying on the software to work "automatically".

The biggest homework is to think about the coefficients with respect to the prediction probabilities. Answer below!