Class - Applied Statistics

# Free Probability-Statistics Class: Predictive Case Study 1, Part X

Review!

Last time we created four models of CGPA. Which is correct? They all are. Why? I should ask as a homework question, but I’ll remind us here. Since all probability is conditional, and these are all ad hoc models, all are correct, given the assumptions. Which one is best? Depends on the decisions you want to make and on what you mean by “best.” Let’s see.

We discovered last time that by itself HGPA was relevant to CGPA, but not by much. SAT was also by itself relevant. The contour plots (which I have decided not to redo, since we did them last week) showed that SAT and HGPA when considered together are also relevant. We also created the “null” model (remember our terminology does not match the classical usage) which only used past data (and grading rules, etc.). We now have to see how useful each of these models are. (If you can’t remember what all these terms mean — review!)

In one sense, we cannot do too much more because we have the models and have made predictions using them. That was our goal, remember? Now we have to sit back and wait for new values of HGPA/SAT and CGPA come in. Then we can see how each of these models’ predictions match reality. This is the True Test.

Here, for instance, is an excerpt of predictions of the full model (HGPA/SAT) (I’m assuming we left off just where we were last time, so that `y` and `a` are still in R’s memory; if not, go over last week’s code):

```> cbind(y[,1:2],round(a,2))
sat hgpa    0    1    2    3    4
1   400    0 0.55 0.26 0.19 0.00 0.00
2   500    0 0.52 0.27 0.21 0.01 0.00
3   600    0 0.47 0.28 0.24 0.01 0.00
4   700    0 0.42 0.30 0.28 0.01 0.00
5   800    0 0.35 0.32 0.32 0.01 0.00
6   900    0 0.28 0.33 0.37 0.01 0.00
7  1000    0 0.22 0.35 0.41 0.02 0.00
8  1100    0 0.16 0.36 0.45 0.03 0.01
9  1200    0 0.11 0.36 0.48 0.04 0.01
10 1300    0 0.08 0.34 0.51 0.06 0.02
11 1400    0 0.05 0.31 0.52 0.08 0.03
12 1500    0 0.04 0.28 0.53 0.11 0.04
13  400    1 0.52 0.24 0.23 0.01 0.00
14  500    1 0.46 0.25 0.27 0.01 0.00
15  600    1 0.40 0.27 0.32 0.01 0.00
16  700    1 0.32 0.29 0.37 0.02 0.00
...
57 1200    4 0.00 0.04 0.22 0.58 0.15
58 1300    4 0.00 0.04 0.18 0.63 0.15
59 1400    4 0.00 0.04 0.14 0.66 0.15
60 1500    4 0.00 0.05 0.12 0.69 0.14
```

We could publish this, or the whole table, and we’d be done! Anybody could take these predictions and implement them. They wouldn’t have to know the details of your model, or of your original data. There is your bold theory, exposed for the world to see! That, after all, is how science should work.

Of course, all the shortcomings of your model will be obvious to anybody who tries to use it, too. Which, again, is just how it should be.

Make sure you understand what we’ve done so far. If you are the Dean and want to classify students into one of five CGPA buckets, we have a predictive model accounting for HGPA and SAT. But suppose you didn’t want to account for HGPA. Well, we have a model of just SAT: use that. And so on. The breakdowns of SAT (every 100) and HGPA (every 1) we used were also geared to the decision. Change the decision, change the breakdown, change the model.

In any case, this is it! This is what we wanted. We wanted to know, given the grading rules and old obs, and values of SAT and HPGA, what is the probability of having CGPA in one of the buckets. And this is what we did! We are done. All those people who wanted practical examples of the predictive way, well, here you go. In the end, it’s pretty simple, isn’t it?

But we can do two more things. (1) We can compare our predictive model (perhaps varying it here and there) with old-fashioned NHST/parameter-estimating models, and (2) we can create a new model that predicts the performance our current model. Number (2) is the really important step, but we won’t get to it today. Let’s do (1).

What model would most use in this situation? A linear regression. Here it is (mixing code and output again). The `cgpa.o` was the original CPGA, not classified into buckets. It is the raw score (the “o” is for original).

```fit.n= glm(cgpa.o ~ sat + hgpa , data=x) # note the cgpa.o! which is the original data
summary(fit.n)

Call:
glm(formula = cgpa.o ~ sat + hgpa, data = x)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-1.12153  -0.44120   0.00954   0.38198   1.80356

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0881312  0.2866638  -0.307 0.759169
sat          0.0012167  0.0003011   4.041 0.000107 ***
hgpa         0.4071133  0.0905946   4.494 1.94e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Wee p-values galore! The asterisks give the glory. Feel it. So, the classical statistician would say, SAT and HPGA are “linked to” CGPA. Researchers will speak of “statistical significance” and say “SAT drives high college grade point” and so on. How much does SAT influence (they’ll say “impact”) CGPA? Well, they might say for every increase in SAT by 1, CGPA goes up on average 0.0012 points. And so on.

Not much more would be done with this model, especially since everything is “significant”. Maybe the modeler throws in an interaction. Whatever. No matter what, this model exaggerates the evidence we have, and is in substantial error, even though it doesn’t look like it. Here’s why.

This model implies a prediction: all models imply predictions, even though they are not routinely made. It’s written in classical form, but the prediction is there, hidden away. Let’s look at it. We do so by integrating out the parameters, picking a “flat” prior, which, for this model anyway, gives us the exact same results for the parameters as the frequentist estimates.

Recall `fit` is our mutlinomial (like) model, including both SAT and HGPA. Let’s pick two hypothetical students, a good one and a bad one, and compare our predictive model with the prediction based on the ordinary regression. Before running this, first re-download the class code, which has been updated to include the code which calculates probability predictions from normal regression models. This is the `obs.glm`, which outputs a central, spread, and degrees-of-freedom parameter for the predictive distribution of the regression (this turns out to be a non-central T).

```y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4)) #bad and good student
a=predict(fit, newdata = y, type='prob')\$p #our mulitnomial moels

plot(levels(x\$cgpa), a[1,],type='h',xlab='College GPA',ylab='Probability',ylim=c(0,1.1*max(a)),lwd=5)
lines(levels(x\$cgpa), a[2,],type='h',lty=2,col=2,lwd=3)
grid()
legend('topright', c("SAT =  400, HGPA = 1","SAT = 1500, HGPA = 4"),lty=1:2,col=1:2,bty='n',lwd=2)

s1 = obs.glm(fit.n,y[1,]) # the prediction from the regression
s2 = obs.glm(fit.n,y[2,])

m = s1\$central-s2\$central  # plotting up the probability DENSITIES, NOT PROBABILITIES
s = sqrt(s1\$scale^2+s2\$scale^2)
w = seq(min(s1\$central,s2\$central)-3*s, max(s1\$central,s2\$central)+3*s, length.out = 100)
lines(w,.7*dt((w-s1\$central)/s1\$scale,s1\$df)/s1\$scale,lty=1,lwd=2)
lines(w,.7*dt((w-s2\$central)/s2\$scale,s2\$df)/s2\$scale,lty=2,col=2,lwd=2)
``` Our multinomial-like model gives the spikes; the regression densities, which are not probabilities, are the curves. We’ll fix the densities into probabilities later. But it’s densities, because the normal lives on the continuum, and so does the predictive distribution from the normal (the t). Notice anything odd? The regression gives probabilities to impossible values, i.e. CGPA less than 0 and greater than 4. I call this probability leakage.

It’s not terrible here, but it does exist. It means the model is predicting impossibilities. The standard regression model is at the least inefficient. Interestingly, this leakage model cannot be falsified. It gives positive probability to events we’ll never see, but it never gives 0 probability anywhere! Falsifiability isn’t that interesting.

That’s enough for this time. Next time, we turn the densities into probabilities, and make modifications to our multinomial model.

Homework Try the normal predictions for the singular models with SAT and HGPA alone, and for the “null” model, which is had by `glm(cgpa.o ~ 1 , data=x)`. Which has the most leakage? Are there huge differences?

Because you asked You can now download `briggs.homework.R`, which contains all the code used in the lectures. Note that this is different than `briggs.class.R`, which can be treated like black-box helper code.

### 4 replies »

1. JH says:

We discovered last time that by itself HGPA was relevant to CGPA, but not by much.

How much is much?

What model would most use in this situation? A linear regression.

“Most used” doesn’t imply “appropriately or correctly used”.  A linear model is not correct for the tempered data.  How do we know that it is incorrect?  The discrete, tempered data structure of CGPA simply does meet the assumptions of a linear model. And evidently, an incorrect model can assign probabilities to impossible values.

An ordinal multinomial probability (MNP) model is still appropriate.  A MNP model wastes the information that, e.g., that a CGPA of 3 is greater and better than 2.  The preference is to extract as much information as possible from the data collected.  Just like in  real life, we often strive to use the information as fully and efficiently as possible when making decisions and inferences.

A fair comparison is to fit a MNP model (though not optimal) to the same tempered data using the two different METHODS of estimating the probabilities. Please also note that in your previous post, you ran a model with different coefficients for each of the five buckets. Such assumption is not imposed in this post. (See http://data.princeton.edu/wws509/notes/c6.pdf
and  https://cran.r-project.org/web/packages/mnlogit/vignettes/mnlogit.pdf.)

2. Briggs says:

JH,

How much is how much can be examined in the plot, or in the objects comprising the plot. Remember: relevance is a probability criterion. Usefulness is a decision criterion.

The package we’re using for the mulitnomial model is far from perfect, as the lesson which introduced it detailed. However, it is adequate for our purposes here, which are to introduce the predictive technique. It is an excellent point on the ordering, which is information we are not using in full. As such, the package is (yet another) approximation.

3. JH says:

An important correction-
The discrete, tempered data structure of CGPA simply does NOT meet the assumptions of a linear model.

So a subjective examination of the plot of allows you to conclude that “that by itself HGPA was relevant to CGPA, but not by much.”

I don’t know… as I get older, it takes longer for my one-farsighted-one-nearsighted eyes to focus, and the numbers in all plots seem too small to be clear. Also, I remember there was a section in an introductory textbook on how to fool others by plots. I thought the answer might be that it would be up to the decision maker based on what you try to advocate.

There are various methods for estimating the multinomial probabilities, e.g. MLE and Bayesian estimations. If you run an MLE estimation of the multinomial probabilities (MNP) as described in the links provided above, dropping a predictor that is not USEFUL in predicting the MNP, will probably give you a conclusion that
that by itself the predictor was relevant to Y, but not by much.”

Changing raw data can be seen as losing information and reducing data accuracy. If one postulates that the usefulness of a predictor in predicting CGPAs depends on CGPA values, say on whether CGPA >= 3.5 or 3.0<= CGPA < 3.5, or 2.5<= CGPA < 3.0, or 2<= CGPA < 2.5, or CGPA<2. Instead of modifing the CGPA into a classification variable, one would use indicator variables to model such postulations, not by changing the y variable CGPA.

4. JH says:

And the indicator variables are used as a way of incooperating the postulation or information quantitatively into a statistical model. Statistical modelling is fun.