William M. Briggs

Statistician to the Stars!

Page 151 of 614

What Regression Really Is: Part II

Yet another misleading, unhelpful but common picture.

Read Part I

Let’s continue our example. Suppose our regression shows that the probability of a Hate score greater than 5 is 60% for men and 80% for women—for people we have not yet seen: we already know all about the people we have measured. It would be tempting to say, given the data observed and given we’re talking about new people who “resemble” the type of people questioned, that “Women have higher Hate scores than men.” But this common way of reporting is highly misleading; firstly because not all (future) women will have higher Hate scores than all (future) men, which is what the headline implies. How many more women than men will have high scores?

To clarify, suppose we learn that the probability of a Hate score greater than 5 for men was 60% and 60.1% for women. In that case, the same headline could also be given (and frequently is), but in this case it is even more misleading. How much greater is 60.1% than 60%? Well, just 0.1%, a true but almost certainly trivial difference (to know it was trivial would require knowing what decisions will be made with these predictions). This is why it is always important to accompany any blanket “more” statement with actual numbers. When you don’t see them, be wary.

Let me stress two things. First, the demarcation at 5 is arbitrary, and the results will change for other levels. It would be far better to show the distributions of scores (via pictures, say), the probabilities for each possible answer for men and women. I’m only using the level 5 as an example. Second, regression is (or should be) a prediction. If we wanted to know how many more women than men had high scores in the observed data, all we have to do is count. We might say, “In our sample, 40% of the women had higher Hate scores” and be on safe ground, even if we were considering other x’s in the model (which in this part of the example we are not). As we’ll see, this “counting analysis” is often the fairest statistical method.

What we’ve really learned (using these numbers) is that if we have two new people before us, a man and a woman, there is a 32% chance (100% x 0.8 x 0.4) the woman will have a high Hate score and the man won’t. There’s also a 12% chance (100% x 0.2 x 0.6) the man will have a high Hate score and the woman won’t. We could do more: there is a 48% chance (100% x 0.8 x 0.6) both the man and woman will have a high score, and an 8% chance (100% x 0.2 x 0.4) they will both have a low score.

So which of these numbers should be reported? Depends on what information you want to convey. The best solution is, as said above, to show probabilities for all possible outcomes. The simplest solution is to show the 60%/80%, but if your audience is interested in calculating (new) numbers of folks who possess high or low scores then the more complete breakdown just given is useful.

These levels of reporting are somewhat less than rarely done, by which I mean I have never seen them, except where the context is explicitly forecasting. Yet every regression is a forecast in the sense that every regression is a prediction of new data that “resembles” the data used to “build the model.” Before tackling resembles, a hint on how reporting usually goes: badly and by overstatement and with an emphasis on unobservable “parameters.” More on these creatures later.

Now resembles. Every statistical method, assuming no calculation or interpretation errors, which are indeed legion, is valid for the type or kind of data used to build that method. Thus every poll—and polls are surely statistical methods; and predictions, too—is a valid, “scientific” poll for the type of people who answered the poll. Mistakes arise when extrapolating, perhaps only mentally or by indirection or ignorance, the model to that which does not resemble its origins.

But what does resembles mean? How can you tell if your new data is “close” to the old? Very tricky, as we’ll see. First consider that teal regressions typically include more than one x and when they do each of them must be present in any results statement. If not, something fishy is going on. For example, suppose we measured sex, age, and education dichotomized at “some college or greater” or “no college”. Then we cannot just say that men have a 60% chance of a Hate scale greater than 5. We must also say men of what type, i.e. we must give their age and education—unless that 60% is constant for every combination of age and education, which is unlikely. If it were the case that no age, for instance, changed the probability of y (taking some value) then there is no need for age in the “model”. So we must issue statements like “For educated 40-year-old men, the probability of a Hate score greater than 5 was 65%, while for 25-year-old educated men it was 72%.”

Now if our sole interest was to learn about the difference between the sexes, we could say something like this: “Controlling for education and age, women have more Hate than men.” This means little, however. The phrase “controlling for” is often put in results statements, but all we can take from it is that the x’s it lists were in the regression. We still have to say what kind of women compared to what kind of men. For example, “Educated 40-year-old women have more Hate than non-educated 30-year-old men.” But we’re still missing by how much, so the statement is incomplete. The difference doesn’t have to be stated in absolute terms (say “75% versus 60%”), but could be put relatively (“25% more Hate”), but relative terms lose the anchor. I.e., we don’t know if we’re going from 60% to 75% or from 10% to 12.5%. And, of course, it could be for some other combination of education and age men have a higher probability of Hate. If so, this should be pointed out. Better still, as above, are pictures of distributions of answers. But these grow complex in more than one or two dimensions. Regression is not as easy as commonly thought.

Next time: infinity wrecks resembles.

What Regression Really Is: Part I

Just the sort of picture which misleads people into thinking regression is causal.

This is the start of a series of reference articles which explain analytical techniques, with a focus on the philosophy, understanding, common mistakes, and not mathematics.


Regression, a.k.a. “linear regression”, is the most-used analysis technique, responsible for nearly all headlines which begin, “New research shows…” Its misuse is also the biggest reason for scientists, a.k.a. “researchers”, believe and promulgate nonsense.

The technique is fairly easy to grasp, at least in the sense that its implementation is trivial. And that is the problem. It’s too easy; or, rather, cheap software combined with a bit of magical thinking (decision by wee p-values) make using regression painless. “Results” are for free the asking—and you know the saying about getting what you pay for. Which is why there is a flood of papers gushing out of academia “proving” anything researchers want to believe.

So what is regression? Let’s first discuss what it should be in broad outline form, eschewing all technical details, which we’ll come to later. Let’s not worry about how it works—no distributions, parameters, or p-values this time—but what it means. All along I’ll give tips on how it is misused and misinterpreted.

Start simple. You have some thing, which is represented by a number, and you want to express the uncertainty that the thing takes certain values. It is customary and a great convenience to call this thing “y”. Y might be a grade point average, an amount of money, tomorrow’s high temperature, an answer to an arbitrary question (“On a scale of 1 to 5…”), and so on endlessly.

Sociologists, who form the largest group of abusers of statistical methods, are great ones for inventing questions and imbuing them with terrible meaning. Typically, they create a questionnaire the answers of which are coded numerically. From these a “scale” is derived, i.e. a number which is a function of the answers. This scale usually ranges from 1 to 5, or from 1 to 9, or something like that. It is always given a hopeful name, like “The Conscientious Index”, “Openness to Change,” or “General Health”. (More on this in another post.)

To fix ideas, use the fictional “Hate Scale” for our y which is comprised of the single question, “On a scale of 1 to 10, how much do you despise those who disagree with you politically?” Apt for our current political milieu. We want to understand the uncertainty of a person answering this question. With what probability will he answer 1? 2? and so on. This is what regression is meant to tell us. Never mind now how regression assigns probabilities, just keep in mind that it does.

Now we might also measure a person’s biological sex speculating males and females will answer the question differently. Or perhaps older or younger people answer differently, so we measure age. Education might play a role. And so forth. There is no limit to the number of things which might cause a person to choose his answer, and indeed something (or things) causes each person to pick his answer. But regression is not (or not usually) a causal discovery model. It is merely correlative. The idea is to measure just those characteristics—call them x’s—which change our minds about y. That is, if the probability y takes a certain value changes knowing a person has this rather than that value of a characteristic, then that characteristic is important to understanding y. If there is no change in the probability of y varying the characteristic, then the characteristic isn’t important.

Regression is supposed to be this: given a particular value of each of the x’s in our “regression model”, regression gives us the probability y takes the values it can take. That’s it; that’s all regression is, or that’s all it should be. Statements of results should concentrate on how much, if at all, each of the x’s change the uncertainty in the y. Causative language should be minimal and cautious.

Before (next time) we get to our main example, regression is meant to be this. Suppose all we had in our model was sex, male or female. Y can take the values 1, 2, …, 10. Given a set of observed data and knowing a person is a male, the regression should tell us the probability this male answers y = 1, y = 2, … and y = 10. Then knowing a person is a female, the regression should again tell us the probability this female answers y = 1, y = 2, … and y = 10. If these two sets of probabilities are exactly the same, then knowing a person’s sex is irrelevant to knowing their Hate score. If the two sets different for any of the levels of y, then something about sex, or something associated with it, is relevant to understanding the uncertainty in the score.

Regression is thus a prediction. It tells us the probability of “events” not yet seen. Consider we do not need regression, or any statistical method, to tell us about the data we have observed, because—can you guess?—we have observed that data. Except in those instances where data is measured with error, we know everything there is to know about that data. If we want to know how many men (of any kind) had Hate scores greater than 5, all we have to do is count. We don’t need to “estimate” anything—except for that which is still hidden from us, like the future.

Simple as that. So why are regression results never stated in this form?

Next time: examples.

Does 1+2+3+… Really Equal -1/12?

Artistic interpretation of infinity.

Infinity is not a number, it is a place. It lies just to the left of the Undiscovered Country. Nobody knows what it is like there, because nobody has seen it. All we can ever do is approach it and say what it isn’t.

Funny thing about Infinity is that they way you get there matters. If you were to head out toward it on a straight line taking one step at a time, and I were to follow taking two steps forward and one step back, we would not necessarily arrive at the same neighborhood. That would depend on how fast we were walking.

None of us would ever get there anyway, not while we’re stuck in time. Infinity isn’t in time. It sure isn’t in our intuitions. Which explains the latest flap.

There’s a video going round (below) which shows that we can assign the value of -1/12 to the infinite sum 1 + 2 + 3 + 4 + … Those dots don’t end until Infinity, always a key something weird is going on. I’ll assume you’ve watched the video.

The path the gentlemen in the video took was not a straight line, which is how they arrived at -1/12 (See also these videos, particularly the third, for more on the path.). But if you were to go straight—the simple sum, i.e. “the limit”—you’d end up right where intuition suggests, at a whopping big number, unimaginably big. Why the difference? Mathematical truths, like all truths, are conditional on the premises assumed and those premises include the paths.

Well, bizarre, right? Yet why shouldn’t Infinity be bizarre? Why a “flap”? Turns out P.Z. Myers, self-proclaimed “rationalist”, saw the video but could not understand it, and since he could not understand it he concluded it therefore could not be true (a line of argument which he frequently employs). So he put up the post “The sum of all natural numbers is not -1/12.” “I saw [the video] and said to myself that it’s obviously wrong”. All the proof he needed.

Switch on the Wayback Machine and slide back to 1990 when Marilyn Vos Savant explained it’s better to switch doors in the Monty Hall Probability Problem (see this for an explanation), a highly non-intuitive result. Thousands of genuine PhD mathematicians reacted like Myers and said “No way this result can be true because I don’t understand it!” Which proves probability is notoriously difficult—and that academic certification is far from a guarantee of infallibility.

Myers also enlisted the support of his own PhD mathematician, Mark Chu-Carroll, who explained carefully but failed to appreciate the difference between limits (a technique which the proofs in the videos do not use) and Cesaro sums (which they do). Chu-Carroll also forgot the -1/12 result was first given to us by Leonhard Euler, perhaps the fattest mathematical brain ever.

The gentleman in video number three (above) was also careful about explaining how Grandi’s series—1 -1 + 1 – 1 + 1 – 1 + …—could, out at Infinity, be +1 or 0, on or off, spin up or down, and that if we consider the series in a sort of probabilistic sense, it can be given the value 1/2. Sounds a bit like quantum mechanics, no? It is this assignment that makes the magic happen and is why the infinite sum can be -1/12. Anyway, Myers didn’t bother to investigate any of this before going off. Which is what makes him a rationalist.

Enter our friend Lubos Motl, arch defender and knight-errant of string theory (he will not see her virtue impugned), who has the habit of writing even simple numbers in Latex, who took Myers to task in the post “Sum of integers and oversold common sense.” Motl also takes pains at showing there is more than one way to sum a series.

Phil “Bad Astronomer” Plait joined in the fray at Slate, which might not have been the wisest move. If there is any place on the Internet where the people already know all they need to know, this is it. And they already knew the infinite sum could not be -1/12. Poor Phil had to issue multiple corrections for being too glib with his language.

Most civilians and rationalists don’t know there is a (let us call it) tension between the kind of math physicists do and the types mathematicians themselves use. Physicists are a little bolder, even playful. Mathematicians are more staid. Full disclosure: I learned my math from physicists. And you may be surprised to learn that there are even warring camps inside each field about the very fundamentals of mathematics. Too much for us here today, except to note that this latest incident is part of the never-ending war of ideas.

Oh, read the history of the Heaviside function for a fun example (I don’t have a link).

Update: now with even more Infinity!

Mathematics isn’t the only place where we meet Infinity. Take the idea of Omniscience, which is knowing everything, and everything includes Infinity. Suppose one knew 10100 facts, a googol of facts. That’s a lot of knowledge, but still far from infinite knowledge. How about knowing a 10^{10^{100}}, a googolplex, of facts? Some estimate there are only about 1082 particles in the universe. If you knew a googolplex of facts you’d be able to name each particle. You’d know where everything is located, including my Kindle which went missing a week ago, so I’d appreciate an email.

A googolplex of facts is already unimaginable, but 10^{10^{100}} is just as far from Infinity as 10100 is. We still have a long way to go before reaching Omniscience. And what’s it like when we get there? Boggles the mind to think about, just like, in a much smaller way, the sum above does. The lesson is intuition, particularly knee-jerk reaction, only takes you so far. And usually to the wrong place.

Update The HTML superscripts weren’t rendering properly, so I switched to Latex for the googolplex.


Thanks to our friend Luis Dias for alerting us to this topic.

He Did What?

An example of just the kind of picture which misleads.

Coming Soon!

I’ve started a series called “What This-or-That Technique Really Is”. First up is regression. It would have shown today, but—can you guess?—it isn’t finished. I made the mistake of also starting “What Decision Analysis Really Is”. And time series, too. Plus a couple of other things.

I need these for another, more long-term purpose, but also because every time I pick apart some paper I don’t want to also have to re-explain what each technique is; that is, what it should be and how people mistakenly used it. It will be nice to have a source to link to.

The math is limited in these, as it is in all blog posts. Philosophy and understanding is emphasized. This is a result of a decision I made years ago. There isn’t anything wrong with math, of course, but as soon as you start putting down equations the concepts too easily become the equations. Reification creeps up unless you’re extraordinarily careful, which most of us are not. Plus, there are a million sites detailing the mathematics of probability and statistics. One more wouldn’t have helped. The big disadvantage is math is vastly easier to write. Words take time.

An Instant Classic

The Classic Post page is revamped. It was a tangle before, but now it is only a mess. I’m not satisfied with the css, which is quirky, but then I don’t have the time to search (again) for another site theme.

What do you think?

It’s still about a year or two behind, meaning I haven’t updated the content, which I’ll begin to do this week. I hope it’s useful to you. Helps me, too, to gather all like subjects together.

More On The Duck Dynasty

David Theroux of The Independent Institute sent along a couple of articles on Duck Dynasty and the Secular Theocracy. Part I, Part II.

This comes on the heels of another TV guy (I’ve never heard of) pulling a Phil. Juan Pablo Galavis (a big name?) said people who engage in same-sex activities are “They’re more pervert in a sense.” Will he back down or hold fast?

Articles were rushed to assure nervous elites that all was well. The one linked to above felt it had to end with the quotation, “Study after study shows that young people raised by gay parents are as happy and healthy as other young people.” This is false. In fact, the opposite is true. Just look up Mark Regnerus and see why.

Hire Me

If you’ve been too confident of late, hire me to tell you why you shouldn’t be. Corporate demotivational lectures are my specialty. Expensive rates. Ask about my two for two special. See my Contact Page today.

No, really: do it today before you forget. Thinking that you’ll remember to do it is one of the things that make you more confident than you should be.

Guest Post

Have something to say too big for the comments? How about a 400-800 word guest post?

« Older posts Newer posts »

© 2015 William M. Briggs

Theme by Anders NorenUp ↑