*Read Part I : Download the Quirkâ€™s article.*

To understand why ordinary regression assumptions are bad, we need to look at some (new data) scenarios.

Suppose that Bob had a HGPA of 4, a SAT score of 1168, and a LTRS of 6; and suppose Alice had the same SAT and LTRS but a HGPA of 3. Those values of the SAT and LTRS are the third-quartiles in the data. Now, given all the old data and the regression model we used, what is the probability that Bob has a CGPA > 4; what is the probability Alice does?

Wait, a CGPA greater than 4? We already said it was impossible! CGPAs *can’t* go higher than 4.

But they can if we use normal distributions in regressions. Here’s the probability (density) of Bob (HGPA = 4) and Alice (HGPA = 3) having any CPGA (these densities are not a normal, although we did start with normals; *these are the observable CGPA and not the “mean” *).

The shaded blue and lovely burnt orange show how much area under the curve (which measures the total amount of probability) given to values larger than 4. A pretty good chunk, no?

The short warning (we’ll save the longer for another time): if your data is restricted (as CGPA is), then you will be making an error using normality in regression. Now, since all data in real life is restricted, you will be making this error a lot.

But ignore this error for now: we’ll turn a blind eye to it like everybody else. Let’s ask this question: what is the probability that Bob has a higher CGPA than Alice, given all the information noted before (and given the screwy normality assumption)?

This is a pretty cool question: it is direct and interesting. And it is useful. Predictive statistics can answer it, but classical procedures do not. The answer is 67%.

There is nothing to interpret here, like there are with p-values or confidence intervals. The number means just what it says: there’s a 67% chance that Bob beats Alice.

Is that important? Not to me, but maybe to you. Here’s what I mean.

Suppose Alice and Bob had identical scores (same HGPA, etc.). Given this, and the regression assumption, what is the probability Bob has a better score than Alice? 50%, right? We have two equal sides of a coin, and it’s 50-50 that Bob wins.

But give Bob a higher HGPA than Alice, and hold their other information constant, he now has a slight edge. This means that the difference between the 67% and 50% gives some idea of the importance of HGPA in predicting CGPA. The further the probability is from 50%, the more important HGPA is. Look at some more examples to see why.

Suppose Bob had a HGPA of 2.54 and Alice a HGPA of 1.64, and let their SAT equal 1015 and LTRS equal 5.2, then the probability Bob beats Alice is 66%. These are the third and first quartiles of HGPA; the other scores are at their medians.

Now let both Bob and Alice have a HGPA of 1.93 (the data median), same LTRS score, but let Bob have an SAT of 1168 and Alice an SAT of 852 (the third and first quartiles). Then the probability Bob beats Alice is 68%. Slightly higher chance than when Bob had a higher HGPA.

Finally, let Bob and Alice have the same (median) HGPA and SAT scores, but let Bob have a LTRS of 4 and Alice one of 6 (third and first quartiles). Then the probability of Bob beating Alice is only 52%. LTRS is not that important in predicting differences in CGPA.

P-values etc. cannot give this kind of information, and are in any case about “null” hypotheses. Predictive probabilities directly answer how strongly each variable effects the outcome (here, CGPA).

Are these scenarios ideal in judging which variable is best, in the sense of the most predictive? Maybe, but I have no idea. The scenarios to use are the ones that make the most sense to the guidance counselors and admissions officers, to the people who will make actual decisions using our model.

Many scenarios present themselves fully wrapped. That is, each would-be college entrant has his high school grades, SAT, and letters. We can use these to directly compute the probability each person has a CGPA of at least, say, 1.0—or whatever other number is deemed to be the minimum acceptable.

Do you see the real magic? The prediction is of an *observable*. *And we can use those observations (from testing our predictions of new students) to check whether our initial regression model is good or poor*.

This should have you whoopee!-ing because classical procedures *never* do this. Results from statistical models litter sociological and medical journals, but they are always presented metaphysically. P-values, confidence intervals, and the “rejecting” of hypotheses are spoken of; internal model diagnostics might be presented.

But *never* does anybody ever check to see whether their model makes accurate and useful predictions on *new data*! This should shock you. A host of unverified, hopeful models are being used to make decisions.

Predictive techniques can elevate the field of statistics to the same level occupied by physics, chemistry, and the other verifiable sciences. We need to bring statistics back to reality.

*Obviously, we have not nearly exhausted this topic. Plus, I need to explain it better. Stay tuned for more. Read Part I : Download the Quirkâ€™s article.*

“Turn a blind eye”, you say, but that was pretty much what I was told as a freshman in 1964! When are my eyes to be opened?

I would like to learn the actual math, if that won’t bore other readers to tears.

are your denisity graphs made with real numbers or just made up for this exercise. Because I find them tough to swallow. I mean expected more than half of hgpa 4.0 students fall under 3.0 in cgpa?

1. Ditto to Doug M, with references (beyond Geisser, please).

2. Where can we download the data, so we can follow along?

3. If I were doing this, I’d be thinking of a 4×4 table, with 3 summary outcomes (Bob>Alice, Alice>Bob, Bob~Alice). With a four point scale, there will be a pot-load of ties.

Could you be a little more specific?

john,

Real numbers all the way.

Doug M,

I’ll give hints to the math. If you want the whole thing, the place to look is if Bernardo and Smith. Look at posterior predictive distributions for regressions with flat priors.

bill r, All,

The data is at

http://wmbriggs.com/book/sat.txt

To read it in R:

x< -read.table('sat.txt',header=FALSE) names(x)<-c('cgpa','hgpa','sat','ltrs') attach(x)

We'll be doing regression in the R lectures later. But if you already know how to get posterior predictive distributions, you'll be able to duplicate my results. See the code myRcode.R in the same books directory for the worked examples.

All,

Oops, I thought this was the R Lecture thread. Here’s the full code:

# my code; also in the book directory

source(Rcode.R)

# regression

fit = glm(cgpa ~ hgpa + sat + ltrs)

# scenario 1

newdata = data.frame(hgpa = 3, sat = 1168, ltrs = 6)

s1 = obs.glm(fit, newdata)

# scenario 2

newdata = data.frame(hgpa = 4, sat = 1168, ltrs = 6)

s2 = obs.glm(fit, newdata)

# the picture; you have to change the legend notation (the leg1 ="blah") etc.

obs.glm.prob(s1,s2,labels = list(xlab="CGPA", leg1 = "HGPA=3" , leg2 = "HGPA=4", main=""))