*Read Part I*

Let’s continue our example. Suppose our regression shows that the probability of a Hate score greater than 5 is 60% for men and 80% for women—for people we have *not yet seen*: we already know *all* about the people we have measured. It would be tempting to say, given the data observed and given we’re talking about new people who “resemble” the type of people questioned, that “Women have higher Hate scores than men.” But this common way of reporting is highly misleading; firstly because *not* all (future) women will have higher Hate scores than all (future) men, which is what the headline implies. How many more women than men will have high scores?

To clarify, suppose we learn that the probability of a Hate score greater than 5 for men was 60% and 60.1% for women. In that case, the same headline could also be given (and frequently is), but in this case it is even more misleading. How much greater is 60.1% than 60%? Well, just 0.1%, a true but almost certainly trivial difference (to know it was trivial would require knowing what decisions will be made with these predictions). This is why it is *always* important to accompany any blanket “more” statement with actual numbers. When you don’t see them, be wary.

Let me stress two things. First, the demarcation at 5 is arbitrary, and the results will change for other levels. It would be far better to show the distributions of scores (via pictures, say), the probabilities for each possible answer for men and women. I’m only using the level 5 as an example. Second, regression is (or should be) a prediction. If we wanted to know how many more women than men had high scores in the observed data, *all we have to do is count*. We might say, “In our sample, 40% of the women had higher Hate scores” and be on safe ground, even if we were considering other x’s in the model (which in this part of the example we are not). As we’ll see, this “counting analysis” is often the fairest statistical method.

What we’ve really learned (using these numbers) is that if we have two new people before us, a man and a woman, there is a 32% chance (100% x 0.8 x 0.4) the woman will have a high Hate score and the man won’t. There’s also a 12% chance (100% x 0.2 x 0.6) the man will have a high Hate score and the woman won’t. We could do more: there is a 48% chance (100% x 0.8 x 0.6) both the man and woman will have a high score, and an 8% chance (100% x 0.2 x 0.4) they will both have a low score.

So which of these numbers should be reported? Depends on what information you want to convey. The best solution is, as said above, to show probabilities for all possible outcomes. The simplest solution is to show the 60%/80%, but if your audience is interested in calculating (new) numbers of folks who possess high or low scores then the more complete breakdown just given is useful.

These levels of reporting are somewhat less than rarely done, by which I mean I have *never* seen them, *except* where the context is explicitly forecasting. Yet every regression is a forecast in the sense that every regression is a prediction of new data that “resembles” the data used to “build the model.” Before tackling *resembles*, a hint on how reporting usually goes: badly and by overstatement and with an emphasis on unobservable “parameters.” More on these creatures later.

Now *resembles*. Every statistical method, assuming no calculation or interpretation errors, which are indeed legion, is valid for the *type* or *kind* of data used to build that method. Thus *every* poll—and polls are surely statistical methods; and predictions, too—is a valid, “scientific” poll for the *type of people* who answered the poll. Mistakes arise when extrapolating, perhaps only mentally or by indirection or ignorance, the model to that which does not resemble its origins.

But what does *resembles* mean? How can you tell if your new data is “close” to the old? Very tricky, as we’ll see. First consider that teal regressions typically include more than one x and when they do each of them must be present in any results statement. If not, something fishy is going on. For example, suppose we measured sex, age, and education dichotomized at “some college or greater” or “no college”. Then we *cannot* just say that men have a 60% chance of a Hate scale greater than 5. We must also say men of *what type*, i.e. we must give their age and education—*unless* that 60% is constant for every combination of age and education, which is unlikely. If it were the case that no age, for instance, changed the probability of y (taking some value) then there is no need for age in the “model”. So we *must* issue statements like “For educated 40-year-old men, the probability of a Hate score greater than 5 was 65%, while for 25-year-old educated men it was 72%.”

Now if our sole interest was to learn about the difference between the sexes, we could say something like this: “Controlling for education and age, women have more Hate than men.” This means little, however. The phrase “controlling for” is often put in results statements, but all we can take from it is that the x’s it lists were in the regression. We still have to say what *kind* of women compared to what *kind* of men. For example, “Educated 40-year-old women have more Hate than non-educated 30-year-old men.” But we’re still missing by how much, so the statement is incomplete. The difference doesn’t have to be stated in absolute terms (say “75% versus 60%”), but could be put relatively (“25% more Hate”), but relative terms lose the anchor. I.e., we don’t know if we’re going from 60% to 75% or from 10% to 12.5%. And, of course, it could be for some other combination of education and age men have a higher probability of Hate. If so, this should be pointed out. Better still, as above, are pictures of distributions of answers. But these grow complex in more than one or two dimensions. Regression is not as easy as commonly thought.

*Next time: infinity wrecks resembles.*