What Do You Really Want: Part II

Class is, thank God, rapidly coming to an end. I am sure we are all grateful. Here’s an addendum to yesterday; but only briefly explained. Because of the crush of the end of the class, I have no idea what’s going on in the world. All emails will be answered starting over the weekend.

We talked earlier of the widespread misuse of normal distributions. They are used with wild abandon in instances where they have no place. Normals should only be used to represent the uncertainty in numbers, the range and diversity of which warrant a reasonable approximation. Here’s what I mean.

Modeling College Grade Point (CGPA) makes an excellent illustration: we want to predict an incoming freshman’s CGPA at the end of the term. Since we do not know, in advance, what that CGPA will be, we can express our uncertainty in it using some probability model. Like everybody else, we’ll use a normal.

To help us predict CGPA, we also have beginning freshmen SAT scores. We might expect that students coming in with higher SATs will have higher CGPA at the end of the first year. Make sense? (In this example, we can, as I often do, also use high school GPA and other pertinent measures; however, the point is easily made with just one variable.)

We start by collecting data on historical freshman. That is, we look at a batch of CGPAs from students in prior years; we also collect their SATs. Like all good statisticians, we display the data in a plot. Doing so, we discover that there is a rough trend in the expected direction: higher SATs are associated with higher CGPA. The relationship is, of course, not perfect. Not every kid with a high SAT has a high CGPA; and not every kid with a low SAT has a low CGPA.

The usual next step is to use a regression model, in which CGPA is predicted by a straight-line function of SAT plus some “random” error. An out-of-the-box regression model uses normal distributions to characterize that “random” error.

Now, pretty much every implementation of software will fit this regression model and then spit out the coefficients of the model: these coefficients are also called parameters, or “betas”. It’s all one. Classically, the thing to do is to form the “null” hypothesis that the parameter associated with SAT in the regression model is precisely equal to 0. The p-value is glanced at, and if—we hope!—it is less than the magic value of 0.05, the classical statistician will announce, “Higher SATs associated with higher CGPAs.” They are not entitled to say that, but never mind. Close enough.

In data that I have, I run this exact test and have received a very publishable p-value of about 0.001. “Highly significant!” it would be said. But let’s be Bayesian and examine the posterior distribution of the parameter. We can compute the probability that the parameter associated with SAT is larger than zero (given the information in our sample, and given the truth of the normal-regression model we used).

OK; that posterior probability is over 99.9%, which means we can be pretty darn sure that the parameter associated with SAT is larger than zero. We are now on firmer ground when we say that “Higher SATs associated with higher CGPAs.”

But just think: we already knew that higher SATs were associated with higher CGPAs. That’s what our old data told us! In our old data, we were 100% sure that, roughly, higher SATs were associated with higher CGPAs. What we really want to know, however, is the CGPAs of future freshmen. We already know all about the freshmen in our dataset.

Now suppose we know that an incoming freshman next year will have an SAT score of 1000. Modern predictive/objective Bayesian analysis allows us to compute the uncertainty we have in the CGPA of this freshman (and of every other freshman who has an SAT of 1000). That is, we can compute the probability that the observable, actual CGPA will take any particular value. This is not the same thing as saying the parameter in the model takes any value. This tells us about what we can actually see and measure.

Here’s the problem. Doing this (for this data) shows us the probability of future CGPAs all right; but because we used normal distributions we have a significant, real probability of seeing CGPAs larger than 4.0. And we also see a significant probability of seeing CGPAs smaller than 0. Both situations are, of course, impossibilities. But because we used a normal distribution, we have about a 10% for the impossible!

Which merely means the normal model stinks. But we never—not ever—would have had a clue of its rottenness if we just examined the p-value or the posterior of the parameter. Parameters are not observable!

This also shows you that even predictive/objective Bayesian analysis fails when you start with a bad model. A bad model stinks in any philosophy. I hope you realize that using the old ways will never give you a hint that your model is bad: you will never know you are making impossible predictions just by looking at parameters.

4 Comments

Katie

June 24, 2010, 10:18 am

Here’s a wrinkle–the institution grants an automatic, and seemingly retroactive, boost in all grades, rendering all GPAs higher than they would have been in order to be more attractive to the job market:

http://www.nytimes.com/2010/06/22/business/22law.html?partner=rss&emc=rss&pagewanted=all

Yet again, all the children are above average.
John Galt

June 24, 2010, 8:33 pm

That’s interesting. I don’t call it normal, I call it the Maxwell Boltzmann distribution. I have noticed that I frequently encounter processes that are what I call uniformly distributed. An example is a computer that employs two data buses. Each operates at a cycle rate of 20 Hz. Computer A controls one bus, computer B controls the other. It is a command response bus, the controller initates all data transfer. The two are not synced. Computer A reads data in from a remote terminal. Computer B then commands computer A to transmit this data. The data can be anywhere from zero to fifty ms old when computer B receives it. The latency is random, but uniformly distributed over the range.
Richard

June 25, 2010, 3:48 am

In the world of the statistic of climate however, wherever I look I see non-normal (skewed) distributions, some quite significantly so.

How does that alter the conclusions that you can derive from either old or new statistics?
Rich

June 25, 2010, 5:02 am

Right. The balls in the bag are future observables. In expressing our uncertainty about their actual proportions in a mathematical model we therefore reference observables not unobservable parameters.

Mind you, ‘parameter’ does seem to be a natural word to use. Perhaps I should try not to get hung up on words ….

What Do You Really Want: Part II

Related

4 Comments

Leave a Reply

Share this:

Related

4 Comments

Leave a Reply