*Read Part II : Download the Quirkâ€™s article.*

Predictive statistics differs from classical (frequentist and Bayesian) practices because it focuses on *observables* and not metaphysical entities. Observables are the data we can see, touch, smell, feel; things that we can measure.

But what are “metaphysical entities”? Things you cannot see, touch, smell, or feel; things that cannot be measured. “Null” and alternate hypotheses and parameters are among them. Too much attention are paid to these, which limits our abilities to make decisions about real data and cause us to be more certain than is warranted.

Here is a simple example, geared to those who have had experience with linear regression.

We want to predict what a student’s college grade point average (CGPA) would be given we know his high school GPA (HGPA), SAT score, and total score (0-10; higher is better) assigned to his letters of recommendation (LTRS). Before looking at the data, conditional on our experience, we would guess that higher HGPAs, SATs, and LTRSs are associated with higher CGPAs.

*Always* begin by looking at your data:

The diagonal plots are the variables’ histograms. We already know that LTRS is restricted to 0-10. But CGPA is restricted to 0-4, and HGPA is restricted to 0-4.25 (for this data set), and SAT is restricted to 400-1600 (these are pre-written-component SAT scores). Remember these restrictions.

The off-diagonal plots work like this: the variable in the *row* is on the y-axis, and the variable in the *column* is on the x-axis. The important row is the top row. The first plot to the right of the histogram in that row is CGPA (y-axis) by HGPA (x-axis). As we guessed, higher HGPAs are associated with higher CGPAs. You can figure out the rest easily.

Those red lines are guesses at the associations, and since these are close to straight lines we are comfortable with using linear regression to model the data. *That’s the usual way to do this: but we should not be comfortable, as we shall see.*

This isn’t the place to teach regression—I’m assuming you know it—but if you don’t, you might guess how it works from the way we write it mathematically:

CGPA = b_{0} + b_{1} HGPA + b_{2} SAT + b_{3} LTRS + noise

This says that a person’s CGPA is a linear function: start with the number b_{0} and add to it b_{1} times his HGPA and add to *that* b_{2} times his SAT and so forth. A little “noise” is added to make the whole thing probabilistic. That noise contains all the stuff we did not observe—but could have—and makes up for the difference between the observed CGPA and the first part of the equation.

Anyway, as the sociologists say, “Let’s submit this to SPSS and see what we get.” I’ll use R, but the answers will be the same. What pops out is a table that looks something like this (R, like all software, shows too much precision):

Estimate | Std. Error | t value | Pr(>|t|) | ||||
---|---|---|---|---|---|---|---|

(Intercept) | -0.1532639 | 0.3229381 | -0.475 | 0.636156 | |||

HGPA | 0.3763511 | 0.1142615 | 3.294 | 0.001385 | |||

SAT | 0.0012269 | 0.0003032 | 4.046 | 0.000105 | |||

LTRS | 0.0226843 | 0.0509817 | 0.445 | 0.657358 |

By “(Intercept)” R means the b_{0} in our model; and by HGPA it means b_{1}, and so forth. The estimate is just that: a guess of value of each b_{i}. The “Std. Error” we can ignore, because it’s incorporated into the “t value”, which is used to test the “null hypothesis” that each b_{i} = 0: and I mean *equals*, precisely zero. The “Pr(>|t|)” is the infamous p-value of this test.

Popular mysticism insists that the p-value should be strictly less than 0.05 to be publishable, so that we can authoritatively state the variable is associated with CGPA. And we’re in luck: The p-values for HGPA and SAT are nice and small. We could write our paper and say there was a “statistically significant” association between HGPA, SAT and CGPA. But not for poor LTRS, which has a depressingly large p-value.

Most would stop the analysis here. Some might push just a bit further and conclude that SAT is a “better” predictor because it has a smaller p-value. The more conscientious would glance at the diagnostics (residual plots, R^{2}, AIC, etc.). I won’t show them, but I assure you all these are fine.

Are we happy? Handshakes all around? Break out the organic cigars? Not yet.

Predictive statistics begins with the idea that the data we have in hand is not unknown—which should be trivially obvious. What we do not know is what *future* data will look like (by “future” I only mean data that we haven’t yet seen; it could have been collected in the past).

Thus, we are *not* interested in whether or not, say, b_{2} = 0, or any other value. We want to know answers to questions like this: Given a person has a high HGPA, decent SAT scores, and an average letter-of-recommendation score, what is the probability he will have a high CGPA?

That question is entirely observable: the data that is given—the conditions which form our questions—and the observed outcome are all measurable, real things. We’ll never know—we can never know—whether b_{2} is zero, or any other value. But we can know whether our prediction of a person’s CGPA is good.

Remember when I said we should not be comfortable with the regression assumptions? Tomorrow we see why.

*Read Part II : Download the Quirkâ€™s article.*

February 9, 2010 at 1:18 pm

And I always thought

p-valuewas a urine test result. Learn something new everyday! OTOH, if you pee on enough data maybe they will yield to your version of truth?February 9, 2010 at 7:54 pm

Shouldn’t those red lines be mirrored about an x=y axis for charts on corresponding sides of the histogram diagonal plots? For example, shouldn’t the CGPA vs HGPA guess be identical with the HGPA cs CGPA guess if you take into account the changes to the axes?

What am I missing?

February 9, 2010 at 8:22 pm

DM,

Good eyes. But, no, they shouldn’t. The red lines are loess smoothers, functions of the type y = f(x) + noise, where the f(x) is allowed to be adaptive to the data. Swap the x and y and you get entirely different functions. We shouldn’t view these as any more than semi-quantitative guesses as to the functional relationship between any y and x.

February 10, 2010 at 7:32 am

Faced with a bunch of numbers representing unfamiliar variables, my first inclination is to normalise them all to the range 0 – 1. At least it has the advantage that I don’t need to memorise that one variable happens to lie between lo and hi and the others between 0 and their own, different, hi.

February 10, 2010 at 10:02 am

dearieme,

Certainly nothing mathematically wrong with doing that, but as you’ll see by the follow-up post, it’s not necessary.

February 10, 2010 at 2:03 pm

Not necessary, but awfully convenient for anyone with a poor memory!

February 10, 2010 at 5:37 pm

Any good books on this to recommend?

I have a very interesting challenge at work which I’ve been very hesitant to attempt to resolve using regression for similar reasons to the ones in your post: given lousy performance of correlated business metrics, what is the chance of regulatory noncompliance if performance isn’t corrected? Process capability-based approaches give me the willies because I have no reason to believe that the normality assumption is valid and we don’t have enough historical data free from special cause variation to map a probability distribution to it without a large amount of error.

I could really use some good rules of thumb to help guide decision makers along the lines of: if you restrict fee refunds beyond point X, there is a 67% chance our wealthiest customers will leave.

Thus my new appetite for predictive statistics.

February 10, 2010 at 6:12 pm

Matt, I appreciate demonstration of the wrong way and the right way to analyze this data, but I am a little perturbed at your use of the strawman: “classical practices”.

Statistics originated with probability, in particular the probabilities at gambling games of chance. Probalistic analyses are classical. If anything, computer software like R is super modern.

There is no doubt that R makes it easy for people to use the entirely wrong method of analysis. That is not a knock on R, or on any other strawman. It is a knock on using the wrong method for a given problem.

February 10, 2010 at 6:51 pm

Uncle Mike,

I can’t say “frequentist practices”, because what I’m talking about holds for most Bayesian analyses, too. Nor can I say “parametric practices” because we reserve “nonparametric” for a specific set of methods.

However, I’ll agree with you that I don’t love the term “classical practices” either. I just haven’t come across a better phrase yet. Maybe “usual practices” would be better. I’ll keep working on it.

Statistics didn’t just start with probability, it

isprobability. But we’ve gotten away from that in, uh, “classical” practices. There doesn’t need to be a separate field of statistics.So, if you like—and maybe I like better, too—this is a

returnto classical practices. True classical.(And, of course, the particular software platform you use is not essential to these posts.)

February 10, 2010 at 6:54 pm

Teflon93,

You can get my class notes (linked over on the left). Jaynes’s

Probability Theoryis indispensable, but very mathematical.You can always email me for some ideas, too.

February 10, 2010 at 9:50 pm

Neo-Classical Probabilistic? Or how about Briggsian?

When I first learned the Newton-Gauss-Euler method, the professor gushed, “I wish my name was added to that troika.” But who needs those old dead guys? I’m going with Briggsian, and let the chips fall where they may.