## Free Data Science Class: Predictive Case Study 1, Part VIII

We’re continuing with the CGPA example. The data is on line, and of unknown origin, but good enough to use as an example.

We will build a *correlational* model, keeping ever in mind this model’s limitations. It can say nothing about cause, for instance.

As we discussed in earlier lessons, the model we will build is in reference to the decisions we will make. Our goal in this model is to make decisions regarding future students’ CGPAs given we have guesses or know their HGPA, SAT, and possibly hours spend studying. We judge at least the first two in the causal path of CGPA. Our initial decision cares about getting CGPA to the nearest point (if you can’t recall why this is most crucially important — review!).

It would be best if we extended our earlier measurement-deduced model, so that we have the predictive model from the get go (if you do not remember what this means — review!). But that’s hard, and we’re lazy. So we’ll do what *everybody* does and use an *ad hoc* parameterized model, recognizing that all parameterized models are *always* approximations to the measurement reality.

Because this is an *ad hoc* parameterized model, we have several choices. Every choice is in response to a premise we have formed. Given “I quite like multinomial logistic regression; and besides, I’ve seen it used before so I’m sure I can get it by an editor”, then the model is in our premises. *All probability follows on our assumptions*.

Now the multinomial logistic regression forms a parameter for every category—here we have 5, for CGPA = 0-5—and says those parameters are functions of parameterized measurements in a linear way. The math of all this is busy, but not too hard. Here is one source to examine the model in detail.

For instance, the parameter for CGPA = 0 is itself said to be a linear function of parameterized HGPA and SAT.

These parameters do not exist, give no causal information, and are of no *practical* interest (no matter how interesting they are mathematically). For instance, they do not appear in what we really want, which is this:

(8) Pr(CGPA = i | guesses of new measures, grading rules, old obs, model), where i = 0,…,5.

We do not care about the parameters, which are only mathematical entities needed to get the model to work. But because we do not know the value of the parameters, the uncertainty in them, as it were, has to be specified. That is, a “prior” for them must be given. If we choose one prior, (8) will given one answer; if we choose a different prior, (8) will (likely) give a different answer. Same thing if we choose a different parameterized model: (8) will give different answers. This does not worry us because we remember *all probability is conditional on the assumptions we make*. CGPA does not “have” a probability! Indeed, the answers (8) gives using different models are usually much more varied than the answers given using the same model but different priors.

What prior should we use? Well, we’re lazy again. We’ll use whatever the software suggests, remembering other choices are possible.

Why not use the MNP R Package for “Fitting the Multinomial Probit Model”? But, wait. Probit is not the same as Logit. That’s true, so let’s update our *ad hoc* premise to say we really had in mind a multinomial probit model. If you do not have MNP installed, use this command, and follow the subsequent instructions about choosing a mirror.

install.packages('MNP', dependencies = TRUE)

There are other choices beside MNP, but unfortunately the software for multinomial regressions is not nearly as developed and as bullet proof as for ordinary regressions. MNP gives the predictive probabilities we want. But we’ll see that it can break. Beside that, our purpose is to understand the predictive philosophy and method, not to tout for a particular *ad hoc* model. What happens below goes for *any* model that can be put in the form of (8). This includes all machine learning, AI, etc.

The first thing is to ensure you have downloaded the data file `cgpa.csv`

, and also the helper file `briggs.class.R`

, which contains code we’ll use in this class. *Warning*: this file is updated frequently! For all the lawyers, I make no guarantee about this code. It might even destroy your computer, cause your wife to leave you, and encourage your children to become lawyers. Use at your own risk. Ensure Windows did *not* change name of `cgpa.csv`

to `cgpa.csv.txt`

.

Save the files in a directory you create for the class. We’ll store that directory in the variable `path`

. Remember, `#`

comments out the rest of what follows on a line.

path = 'C:/Users/yourname/yourplace/' # for Windows #path = '/home/yourname/yourplace/' # for Apple, Linux # find the path to your file by looking at its properties # everything in this class is in the same directory source(paste(path,'briggs.class.R',sep='')) # runs the class code x = read.csv(paste(path,'cgpa.csv',sep='')) x$cgpa.o = x$cgpa # keeps an original copy of CGPA x$cgpa = as.factor(roundTo(x$cgpa,1)) # rounds to nearest 1 summary(x) table(x$cgpa)

You should see this:

> summary(x) cgpa hgpa sat recomm cgpa.o 0: 4 Min. :0.330 Min. : 400 Min. : 2.00 Min. :0.050 1:17 1st Qu.:1.640 1st Qu.: 852 1st Qu.: 4.00 1st Qu.:1.562 2:59 Median :1.930 Median :1036 Median : 5.00 Median :1.985 3:16 Mean :2.049 Mean :1015 Mean : 5.19 Mean :1.980 4: 4 3rd Qu.:2.535 3rd Qu.:1168 3rd Qu.: 6.00 3rd Qu.:2.410 Max. :4.250 Max. :1500 Max. :10.00 Max. :4.010 > table(x$cgpa) 0 1 2 3 4 4 17 59 16 4

The measurement `recomm`

we’ll deal with later. Next, the model.

require(MNP) # loads the package fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=2000) #fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=10000)

The model call is obvious enough, even if `burnin = 2000, n.draws=2000`

is opaque.

Depending on your system, the model fit might break. You might get an odd error message ("TruncNorm: lower bound is greater than upper bound") about inverting a matrix which you can investigate if you are inclined (the problem is in a handful of values in `sat`

, and how the model starts up). This algorithm uses MCMC methods, and therefore cycles through a loop of size `n.draws`

. All we need to know about this (for now) is that because this is a numerical approximation, larger numbers give less sloppy answers. Try `n.draws=10000`

, or even five times that, if your system allows you to get away with it. The more you put, the longer it takes.

We can look at the output of the model like this (this is only a partial output):

> summary(fit) Call: mnp(formula = cgpa ~ sat + hgpa, data = x, n.draws = 50000, burnin = 2000) Coefficients: mean std.dev. 2.5% 97.5% (Intercept):1 -1.189e+00 2.143e+00 -7.918e+00 0.810 (Intercept):2 -1.003e+00 1.709e+00 -5.911e+00 0.664 (Intercept):3 -8.270e+00 3.903e+00 -1.630e+01 -1.038 (Intercept):4 -2.297e+00 3.369e+00 -1.203e+01 -0.003 sat:1 9.548e-04 1.597e-03 -3.958e-04 0.006 sat:2 1.065e-03 1.488e-03 -7.126e-06 0.005 sat:3 4.223e-03 2.655e-03 2.239e-05 0.010 sat:4 1.469e-03 2.202e-03 1.704e-06 0.008 hgpa:1 9.052e-02 3.722e-01 -5.079e-01 0.953 hgpa:2 1.768e-01 3.518e-01 -2.332e-01 1.188 hgpa:3 1.213e+00 6.610e-01 1.064e-01 2.609 hgpa:4 3.403e-01 5.242e-01 -7.266e-04 1.893

The `Coefficients`

are the parameters spoken of above. The `mean`

etc. are the estimates of these unobservable, not-very-interesting entities. Just keep in mind that because a coefficient is large, does not mean its effect on the probability of CGPA = i is itself large.

We do care about the predictions. We want (8), so let's get it. Stare at (8). On the right hand side we need to guess values of SAT and HGPA for a future student. Let's do that for two students, one with a low SAT and HGPA, and another with high values. You shouldn't have to specify values of CGPA, since these are what we are predicting, but that's a limitation of this software.

y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4)) a=predict(fit, newdata = y, type='prob')$p

The syntax is decided by the creators of the MNP package. Anyway, here's what I got. You will *NOT* see the exact same numbers, since the answers are helter-skelter numerical approximations, but you'll be close.

> a 0 1 2 3 4 [1,] 0.519000 0.24008333 0.2286875 0.0115 0.0007291667 [2,] 0.000125 0.04489583 0.1222917 0.6900 0.1426875000

There are two students, so two rows of predictions for each of the five categories. This says, for student `(sat=400, hgpa=1)`

, he'll most like see a CGPA = 0. And for `(sat=1500, hgpa=4)`

, the most likely is a CGPA = 3. You can easily play with other scenarios. But, and this should be obvious, if (8) was our goal, we are done!

Next time we'll build on the scenarios, explore this model in more depth, and compare our model with classical ones.

**Homework** Play with other scenarios. Advanced students can track down the objectionable values of `sat`

that cause grief in the model fit (I wrote a script to do this, and known which ones they are). Or they can change the premises, by changing the starting values of the parameters. We didn't do that above, because most users will never do so, relying on the software to work "automatically".

The biggest homework is to think about the coefficients with respect to the prediction probabilities. Answer below!