# What Regression Really Is: Part II

*Read Part I*

Let’s continue our example. Suppose our regression shows that the probability of a Hate score greater than 5 is 60% for men and 80% for women—for people we have *not yet seen*: we already know *all* about the people we have measured. It would be tempting to say, given the data observed and given we’re talking about new people who “resemble” the type of people questioned, that “Women have higher Hate scores than men.” But this common way of reporting is highly misleading; firstly because *not* all (future) women will have higher Hate scores than all (future) men, which is what the headline implies. How many more women than men will have high scores?

To clarify, suppose we learn that the probability of a Hate score greater than 5 for men was 60% and 60.1% for women. In that case, the same headline could also be given (and frequently is), but in this case it is even more misleading. How much greater is 60.1% than 60%? Well, just 0.1%, a true but almost certainly trivial difference (to know it was trivial would require knowing what decisions will be made with these predictions). This is why it is *always* important to accompany any blanket “more” statement with actual numbers. When you don’t see them, be wary.

Let me stress two things. First, the demarcation at 5 is arbitrary, and the results will change for other levels. It would be far better to show the distributions of scores (via pictures, say), the probabilities for each possible answer for men and women. I’m only using the level 5 as an example. Second, regression is (or should be) a prediction. If we wanted to know how many more women than men had high scores in the observed data, *all we have to do is count*. We might say, “In our sample, 40% of the women had higher Hate scores” and be on safe ground, even if we were considering other x’s in the model (which in this part of the example we are not). As we’ll see, this “counting analysis” is often the fairest statistical method.

What we’ve really learned (using these numbers) is that if we have two new people before us, a man and a woman, there is a 32% chance (100% x 0.8 x 0.4) the woman will have a high Hate score and the man won’t. There’s also a 12% chance (100% x 0.2 x 0.6) the man will have a high Hate score and the woman won’t. We could do more: there is a 48% chance (100% x 0.8 x 0.6) both the man and woman will have a high score, and an 8% chance (100% x 0.2 x 0.4) they will both have a low score.

So which of these numbers should be reported? Depends on what information you want to convey. The best solution is, as said above, to show probabilities for all possible outcomes. The simplest solution is to show the 60%/80%, but if your audience is interested in calculating (new) numbers of folks who possess high or low scores then the more complete breakdown just given is useful.

These levels of reporting are somewhat less than rarely done, by which I mean I have *never* seen them, *except* where the context is explicitly forecasting. Yet every regression is a forecast in the sense that every regression is a prediction of new data that “resembles” the data used to “build the model.” Before tackling *resembles*, a hint on how reporting usually goes: badly and by overstatement and with an emphasis on unobservable “parameters.” More on these creatures later.

Now *resembles*. Every statistical method, assuming no calculation or interpretation errors, which are indeed legion, is valid for the *type* or *kind* of data used to build that method. Thus *every* poll—and polls are surely statistical methods; and predictions, too—is a valid, “scientific” poll for the *type of people* who answered the poll. Mistakes arise when extrapolating, perhaps only mentally or by indirection or ignorance, the model to that which does not resemble its origins.

But what does *resembles* mean? How can you tell if your new data is “close” to the old? Very tricky, as we’ll see. First consider that teal regressions typically include more than one x and when they do each of them must be present in any results statement. If not, something fishy is going on. For example, suppose we measured sex, age, and education dichotomized at “some college or greater” or “no college”. Then we *cannot* just say that men have a 60% chance of a Hate scale greater than 5. We must also say men of *what type*, i.e. we must give their age and education—*unless* that 60% is constant for every combination of age and education, which is unlikely. If it were the case that no age, for instance, changed the probability of y (taking some value) then there is no need for age in the “model”. So we *must* issue statements like “For educated 40-year-old men, the probability of a Hate score greater than 5 was 65%, while for 25-year-old educated men it was 72%.”

Now if our sole interest was to learn about the difference between the sexes, we could say something like this: “Controlling for education and age, women have more Hate than men.” This means little, however. The phrase “controlling for” is often put in results statements, but all we can take from it is that the x’s it lists were in the regression. We still have to say what *kind* of women compared to what *kind* of men. For example, “Educated 40-year-old women have more Hate than non-educated 30-year-old men.” But we’re still missing by how much, so the statement is incomplete. The difference doesn’t have to be stated in absolute terms (say “75% versus 60%”), but could be put relatively (“25% more Hate”), but relative terms lose the anchor. I.e., we don’t know if we’re going from 60% to 75% or from 10% to 12.5%. And, of course, it could be for some other combination of education and age men have a higher probability of Hate. If so, this should be pointed out. Better still, as above, are pictures of distributions of answers. But these grow complex in more than one or two dimensions. Regression is not as easy as commonly thought.

*Next time: infinity wrecks resembles.*

knowing what decisions with be made will these predictionsI grewed up in a backwater town and wents tp skewl in Pixsburgh so I knows what you means.

DAV,

My enemies are out in force.

So far, so good. I am grateful for this and hope that you are going to draw the distinction between regression and correlation later.

But another point I would like to see discussed is how statistics can demonstrate causative links between parameters, for example this little gem on a subject which I think is close to your heart: http://annals.org/article.aspx?articleid=1814426

Briggs,

S’OK. Only True Anglers are comfortable with Anglish.

I find I keep asking, “Yes, but what if I want to know whether the difference between men and women is real or just an artifact of sampling.”

Then I realise that that means I want to say something about the whole population under consideration. Which is a prediction of what would happen if I asked them all.

But somehow it never feels like that first time round.

suppose we measured sex, age, and education dichotomized at â€œsome college or greaterâ€ or â€œno collegeâ€. Then we cannot just say that men have a 60% chance of a Hate scale greater than 5. We must also say men of what type, i.e. we must give their age and educationâ€”unless that 60% is constant for every combination of age and education, which is unlikely.You have to do this simply because you measured it? Why? And why stop there? What if I noticed that some men had more acne than others. Should that be carried over as well?

—

BTW: I’m not in favor of

tealregressions. I favor the azure ones. If I were a sociologist, I would lean toward the crimson ones as they would seem more in line with my politics.It would be helpful to elaborate on precisely what people mean when they use this phrase.

A bit off topic, but I came across this today.

http://daily.sightline.org/2013/12/23/traffic-forecast-follies/

You use a regression to model a forecast of demand, and when demand doesn’t meet the forecast, you stick to your guns!

Of course, what are the institutional rewards for erring on the side of excessive demand vs. insufficient demand.

By the way, could you consider tackling the GDP vs. Unemployment chart that you have clipped into today’s post.

Doug, that was my experience with Regression Analysis, a course in using historical data and regression analysis for business forecasting.

“if we have two new people before us, a man and a woman, there is a 32% chance (100% x 0.8 x 0.4) the woman will have a high Hate score and the man wonâ€™t”

What does the word “chance” mean here?

Essentially, the thing is we assume that the correlations present in the data will remain in any future data. But future correlations require a data set. Can you really speak of “a man and a woman”? You really need to speak of N man and N woman with N>>1.

David Eyles,

“how statistics can demonstrate causative links between parameters”

Do you mean that pure number-crunching without any theory or model of the mechanism can establish causation?

Doubtful. I think that causation is always a posited thing. We observe correlations and to account for these correlations, a model or a theory is build up. The theory or model is basically what causation is.

Q: Why are the errors squared in “standard” regression? Versus say simply averaged? Is there really a numeric basis for this?

This has always seemed somewhat arbitrary to me. When performing standardized accuracy measurements, the std. dev. can be adversely affected by one or two “bad” points, and a slope can be more a function of the outliars, then the “in-liars”. Watching a regression change in real time as you pop out certain data points can be a revealing experience.

Now I know you should toss out the “bad” points first, but this is not always possible with standardized testing. The outliars in engineering tend to be errors in measurement. Quite frankly I would think a “least square roots” would be better tool for a lot of my work.

Q: Why are the errors squared in â€œstandardâ€ regression? Versus say simply averaged? Is there really a numeric basis for this?

First I am going to give the stupid answer.

If you “average the errors” there are a multitude of lines that fit any data set with zero average error. There will be positive errors and negative errors and the errors will average to zero.

So, now you say… No! minimize the absolute error.

slightly less stupid answer, the absolute value function is not a “nice” function. Not nice in the sense that the graph of the function has a pointy bit, and calculus doesn’t like pointy bits.

The squared errors are then the “second moment.” I think the mathematicians found the 1st, 2nd, 3rd, 4th moments and then decided what they represented.

1st moment is the mean

2nd moment is the variance (spread)

3rd moment is the skew

4th moment is the “kurtosis” (fat tails)

more on squared error.

Squared error puts more emphasis on the outliers than absolute deviation. Depending on what you are doing this is a good thing or a bad thing.

You could say, An outlier is a breakdown of normal behavior and I only care to model normal behavior because there is too much I don’t know about what happens when things are abnormal.

But, if you are modeling financial data, you would say that the one day when everything goes to hell makes or breaks my year (or career), and the 1/100 events are what I care most about. The problem is that you have a shortage of 1/100 events to calibrate from, and if you look too far back in history you say, “things were fundamentally different then.”

Doug M, Thanks for the response. It clears some things up.

Yeah, I found the “because it makes them all positive” answer a bit unsatisfying. I heard that one before. I obviously meant averaging their absolute values.

I looked on Wikipedia and they had some references to orbital mechanics where I guess it makes sense.

And the “it’s application specific” and you need to examine your application to determine whether it is appropriate is a little more satisfying. The financial example is a very good explanation.

But then, how do you know when it’s appropriate to use a different error measurement “summation technique”, and which one should you use?

Now in my world of engineering, a common use is to take a bunch of measurements, put it through a regression, and then use the equation for prediction. The thing is the most common source of bad points here are errors in measurement. I don’t want to calibrate something by maximizing the errors in measurement.

I can tell you that I’ve met approx. 0 people who understand this. The bulk of tech people think its magic and it just accounts for all errors and out pops the perfect equation. Do not question the innards.

I’ve seen times when the regression equation woefully fails the eyeball test producing an equation barely passing through the data points (there were 0’s in the data set for missing points, not even graphed). Guy accepted the results as “true”.

I think if you are looking for “the leastest error, the mostest of the time”, a common need for stuff I do, it would be more appropriate to minimize the outliars. Especially noting that in many cases the outliars are not “true” data points.

Hey Matt, wouldn’t this particular problem be better treatment by some sort of categorical analysis? That is to say, isn’t regression sort of an artificial technique to treat categorical variables? Or am I being naive and non-sociologically insightful?