Feb 09 2010
Predictive Statistics: GPA Case Study, Part I
This is a follow up to the Quirk’s article.
Predictive statistics differs from classical (frequentist and Bayesian) practices because it focuses on observables and not metaphysical entities. Observables are the data we can see, touch, smell, feel; things that we can measure.
But what are “metaphysical entities”? Things you cannot see, touch, smell, or feel; things that cannot be measured. “Null” and alternate hypotheses and parameters are among them. Too much attention are paid to these, which limits our abilities to make decisions about real data and cause us to be more certain than is warranted.
Here is a simple example, geared to those who have had experience with linear regression.
We want to predict what a student’s college grade point average (CGPA) would be given we know his high school GPA (HGPA), SAT score, and total score (0-10; higher is better) assigned to his letters of recommendation (LTRS). Before looking at the data, conditional on our experience, we would guess that higher HGPAs, SATs, and LTRSs are associated with higher CGPAs.
Always begin by looking at your data:

The diagonal plots are the variables’ histograms. We already know that LTRS is restricted to 0-10. But CGPA is restricted to 0-4, and HGPA is restricted to 0-4.25 (for this data set), and SAT is restricted to 400-1600 (these are pre-written-component SAT scores). Remember these restrictions.
The off-diagonal plots work like this: the variable in the row is on the y-axis, and the variable in the column is on the x-axis. The important row is the top row. The first plot to the right of the histogram in that row is CGPA (y-axis) by HGPA (x-axis). As we guessed, higher HGPAs are associated with higher CGPAs. You can figure out the rest easily.
Those red lines are guesses at the associations, and since these are close to straight lines we are comfortable with using linear regression to model the data. That’s the usual way to do this: but we should not be comfortable, as we shall see.
This isn’t the place to teach regression—I’m assuming you know it—but if you don’t, you might guess how it works from the way we write it mathematically:
CGPA = b0 + b1 HGPA + b2 SAT + b3 LTRS + noise
This says that a person’s CGPA is a linear function: start with the number b0 and add to it b1 times his HGPA and add to that b2 times his SAT and so forth. A little “noise” is added to make the whole thing probabilistic. That noise contains all the stuff we did not observe—but could have—and makes up for the difference between the observed CGPA and the first part of the equation.
Anyway, as the sociologists say, “Let’s submit this to SPSS and see what we get.” I’ll use R, but the answers will be the same. What pops out is a table that looks something like this (R, like all software, shows too much precision):
| Estimate | Std. Error | t value | Pr(>|t|) | ||||
|---|---|---|---|---|---|---|---|
| (Intercept) | -0.1532639 | 0.3229381 | -0.475 | 0.636156 | |||
| HGPA | 0.3763511 | 0.1142615 | 3.294 | 0.001385 | |||
| SAT | 0.0012269 | 0.0003032 | 4.046 | 0.000105 | |||
| LTRS | 0.0226843 | 0.0509817 | 0.445 | 0.657358 |
By “(Intercept)” R means the b0 in our model; and by HGPA it means b1, and so forth. The estimate is just that: a guess of value of each bi. The “Std. Error” we can ignore, because it’s incorporated into the “t value”, which is used to test the “null hypothesis” that each bi = 0: and I mean equals, precisely zero. The “Pr(>|t|)” is the infamous p-value of this test.
Popular mysticism insists that the p-value should be strictly less than 0.05 to be publishable, so that we can authoritatively state the variable is associated with CGPA. And we’re in luck: The p-values for HGPA and SAT are nice and small. We could write our paper and say there was a “statistically significant” association between HGPA, SAT and CGPA. But not for poor LTRS, which has a depressingly large p-value.
Most would stop the analysis here. Some might push just a bit further and conclude that SAT is a “better” predictor because it has a smaller p-value. The more conscientious would glance at the diagnostics (residual plots, R2, AIC, etc.). I won’t show them, but I assure you all these are fine.
Are we happy? Handshakes all around? Break out the organic cigars? Not yet.
Predictive statistics begins with the idea that the data we have in hand is not unknown—which should be trivially obvious. What we do not know is what future data will look like (by “future” I only mean data that we haven’t yet seen; it could have been collected in the past).
Thus, we are not interested in whether or not, say, b2 = 0, or any other value. We want to know answers to questions like this: Given a person has a high HGPA, decent SAT scores, and an average letter-of-recommendation score, what is the probability he will have a high CGPA?
That question is entirely observable: the data that is given—the conditions which form our questions—and the observed outcome are all measurable, real things. We’ll never know—we can never know—whether b2 is zero, or any other value. But we can know whether our prediction of a person’s CGPA is good.
Remember when I said we should not be comfortable with the regression assumptions? Tomorrow we see why.



