# What Regression Really Is

Bookmark this one, will you, folks? If there’s one thing we get more questions about and that is more abused than regression, I don’t know. So here is the world’s briefest—and most accurate—primer. There are hundreds of variants, twists and turns, and tweaks galore, but here is the version most use unthinkingly.

- Take some thing in which you want to quantify the uncertainty. Call it y: y can be somebody’s income, their rating on some HR form, a GPA, their blood pressure, anything. It’s a number you don’t know but want to.
- Next write y ~ N(m, s), which means
*this and nothing else*: “Our uncertainty in the value y takes is quantified by a normal distribution with central parameter m and spread parameter s.” It means you don’t know what value y will take in any instance, but if you had to bet, it would take one of the values quantified by the probabilities specified by the mathematical equation N(m,s). - We never, absolutely never, say “y is normally distributed.” Nothing in the universe is “normally distributed.” We use the normal to quantify our uncertainty. The normal has no power over y. It is not real.
- The probability y takes
*any*value, even the values you actually did see, given*any*normal distribution, is 0. Normal distributions are bizarre and really shouldn’t be used, but always are. Why if they are so weird are they ubiquitous? Some say insanity, others laziness, and still more ignorance. I say it’s because it’s automatic in the software. - Collect probative data—call it x—which you hope adds information about y. X can be anything: sex, age, GDP, race, anything. Just to fix an example, let x
_{1}be sex, either male or female, and let y be GPA. We want to say how sex informs our uncertainty of a person’s GPA. - Regression is this: y ~ N(b
_{0}+ b_{1}*I(sex=Male), s). - This says that our uncertainty in y is quantified by a normal distribution with central parameter b
_{0}+ b_{1}*I(sex=Male)*and*spread parameter s. The funny “I(sex=Male)” is an indicator function and takes the value 1 when it’s argument is true, else it equals 0. Thus, for males, the central parameter is b_{0}+ b_{1}and for females it is just b_{0}. Pause here until you get this. - This could be expanded indefinitely. We could write y ~ N(b
_{0}+ b_{1}*I(sex=Male) + b_{2}* Age + b_{3}* Number of video games owned, s), and on and on. It means we draw a different normal distribution for GPA uncertainty for every combination of sex, age, and numbers of video games. Notice the equation for the central parameter is linear. Our choice! - Regression is
*not*an “equation for y”. Regression does*not*“model y”. Regression only quantifies our uncertainty in y conditioned on*knowing*the value of some x’s. - The b’s are also called parameters, or coefficients, or betas, etc. If we knew what the values of the b’s were, we could draw separate normal distributions, here one for men and one for women. Both would have the
*same*spread, but different central points. - We do not ordinarily know the values of the parameters. Classically we guess using some math which isn’t of the slightest interest to us in understanding what regression is. We call the guesses “b-hats” or “beta-hats”, to indicate we don’t know what b is but it is
*just a guess*. The guesses are given the fancy title of “estimates” which makes it sound like science. - Ninety-nine-point-nine-nine percent of people stop here. If b
_{1}is not equal to 0 (judged by a magical p-value), they say incorrectly “Men and women are different.” Whether or not this is true, that is not what regression proves. Instead, if it were true that b_{1}was not equal to 0, then*all*we could say was that “Our uncertainty in the GPAs of females is quantified by a normal distribution with central parameter b_{0}and spread s, and our uncertainty in the GPAs of males is quantified by a normal distribution with central parameter b_{0}+ b_{1}and spread s. - Some people wrongly say “Males have higher GPAs” if b
_{1}is positive or “Males have lower GPAs” if b_{1}is negative. This is false, false, false, false, and false some more. It is wrong, misleading, incorrect, and wrong some more, too. It gives the errant impression that (if b_{1}is positive) males have higher GPAs, when all we can say is that the probability that any given male has a higher GPA than any given female is greater than 50%. If we knew the values of the b’s*and*s, we could quantify this exactly. - We do not know the values of the b’s and s. And there’s no reason in the world we should be interested, though the subject does seem to fascinate. The b’s are not real, they are fictional parameters we made up in the interest of the problem. This is why when you hear somebody talk about “The true value of b” you should be as suspicious as when a politician says he’s there to help you.
- What should then happen, but almost never does, is to account for the uncertainty we have in the b’s. We could, even not knowing the b’s, make statements like, “Given the data we observed and accepting we’re using a normal distribution to quantify our uncertainty in GPA, the probability that any given male has a higher GPA than any given female is W%.” If W% was equal to 50%, we could say that knowing a person’s sex tells us nothing about that person’s GPA. If W% was not exactly 50% but close to it—where “close” is up to each individual to decide: what’s close for one wouldn’t be for another—we could ignore sex in our regression and concentrate on each students’ age and video game number.
- This last and necessary but ignored step was the point of regression; thus that it’s skipped is an argument for depression. It is not done for three reasons. (1) Nobody thinks of it. (2) The p-values which say whether each b
_{i}should be judged 0 or not mesmerize. (3) Even if we judge the probability, given the data, that b_{i}is greater than 0 is very high (or very low), this does*not*translate into a discernible or useful difference in our understanding of y and people prefer false certainty over true uncertainty. - In our example it could be that the p-value for b
_{1}is wee, and its posterior shows the probability it is greater than 0 is close to 1, but it still could be that, given the data and assuming the normal, the probability any given male has a GPA larger than any given female is (say) 50.01%. Knowing a person’s sex tells us almost nothing about this person’s GPA. - But it could
*also*be that the p-value for b_{1}is greater than the magic number, and the posterior also sad, but that (given the etc.) the probability a male has a higher GPA than a female is (say) 70%, which says something interesting. - In short, the b’s do not tell us directly what we want to know. We should instead solve the equation we set up!

Obviously, I have ignored much. Entire textbooks are written on this subject. Come to think of it, I’m writing one, too.

You obviously have a different concept of regression than I do. To me it is curve fitting and can be linear or non-linear and involves calculations of least squares. Experimental uncertainty can be added but is not necessary. Error matrices calculated are indications of goodness of fit or of correlations between parameters. I’m not sure that there is any assumption of a Gaussian form to this uncertainty but could be wrong about that. In statistical mechanics the uncertainty of measurement of say pressure often has this form. The kind of statistics that you specialize in seems worlds apart and mostly pointless to judge by your exposures of endless abuses. But this may just be discipline snobbery.

Scotian,

Oh, I’m sure it’s the same. You fit a curve, but how do you account for all those points lying off the curve? That’s all the rest of what I said.

All the points lie off the curve. We are just looking for the best fit. I understand most of what you are saying but am a little unclear as to why you call it the real regression.

“Why if they [normal distributions, which are continuous, and go to infinity in both directions] are so weird are they ubiquitous?”

Let me extend a possible answer to that to the more general question of why, e.g., “What exactly does take place in ‘curve-fitting’?” isn’t a hot topic of conversation.

Even beyond the inertia of use, reputation, grants disappearing, etc., there is the inertia of healthy custom, or of common sense, meaning that we don’t, and shouldn’t, ordinarily make time to wonder if there is something deeply wrong with a procedure or tool that everybody uses and which seems to work fine.

It was difficult indeed for me to realize that a regression line or curve is merely a representation of the most likely guess for the value of the central parameter of the particular probability distribution that we have selected (or allowed software or custom to select for us), given the data and assumptions brought to bear — and I have no skin in the game. I don’t typically use, or need, regression professionally. It doesn’t actually matter to me if a regression line is that, or something else much more satisfyingly real-world precise.

How much more difficult might it be to convince an engineer, for example, that his certainty is in his careful data, checked and cross-checked by friend and foe alike over years, if not generations, and not in his regressions? Or rather, that his curve-fitting works if and only if his data works, and that it cannot ever be that fitting a curve improves data?

Let alone bio-medical, sociological, psychological, etc. fields, which are far less given over to careful checking and cross-checking.

We don’t, and shouldn’t, stop what we are doing every 5 minutes to take the square root of the universe. It is healthy and good that we not do that. But just occasionally, something is in fact deeply wrong (‘weird’) about a procedure or tool that we customarily use, and we can prove it.

Then, we are lonely, at least for a time.

If you discover that some theory or process is “wrong” or “nonsensical” or “weird”, and yet despite this it turns out that it works in practice, you should at least consider the possibility that what’s “wrong” is your understanding, not the process.

“The probability y takes any value, even the values you actually did see, given any normal distribution, is 0. Normal distributions are bizarre and really shouldnâ€™t be used, but always are.”I answered this one on a previous post, but evidently it got buried and nobody read it.

The error is in treating your observation of y as a real number of infinite precision, with no uncertainty, and then treating the probability distribution as purely a function on such infinitely precise points. It isn’t. Although as the limit of a sequence of ever more precise finite observations, it happens to work fine and the details of what is really going on usually don’t matter.

A probability distribution of a Real number variable y is *not* a function on the Real line. It’s a function of a mathematical structure called a Sigma Algebra, which is a set of selected *subsets* of the line. For example, the Borel sigma algebra is the set of all *intervals* on the line and all complements and countable unions of them.

When we make an observation, we never actually see a Real number. We say the observation was 4.13796, but what we really mean is that it was in the interval 4.137955 to 4.137965. We can only make observations with a finite precision. And thus, the space of possible observations we can make are *not* Real numbers, they are *intervals*!

Isn’t it fortunate then that our Normal distribution is a function on such intervals, and gives for each interval a non-zero probability of it occurring? Everything works out fine.

And if we consider the limit of ever more precise observations, the probability of each possible outcome interval must shrink as the interval lengths shrink, or the numbers won’t add up properly. The zero probability is an artifact of assuming physically impossible infinite precision measurements, and not anything wrong with the probability distribution itself. It is correctly answering the question we asked. If we don’t understand our own question, the answer may seem nonsensical. But this is not the Normal distribution’s fault.

There are some other bits of the post above that strike me as an odd interpretation as well, but I’m not going to try right now to pick them apart. No doubt the book will have a fuller explanation.

” Iâ€™m not sure that there is any assumption of a Gaussian form to this uncertainty but could be wrong about that.”The argument for a least squares fit being optimal is based on a Gaussian assumption. Of course, if you don’t require optimality, you don’t need to make any assumptions.

The basic idea is that if the probability of an individual measurement error is proportional to exp(-x^2), then the joint probability of a lot of individual **and statistically independent** measurements is the product of those probabilities exp(-x1^2)exp(-x2^2)…exp(-xn^2) = exp(-(x1^2+x2^2+…+xn^2)). So minimising the sum of squared errors maximises the exponential of its negative, or the probability. (I’ve simplified the argument a bit, but you should be able to fill in the details.)

Thus, the least-squares best fit gives the parameters of the distribution for which the observed outcome is most likely out of all the distributions with the specified form. At least, when that form is a deterministic function of the inputs plus independent zero-mean, fixed variance Gaussian errors.

If the specified form is something else, then you can either do some mathematics to figure out what the corresponding maximum likelihood estimator is, or you can accept that the result you get won’t be accurate and just use least-squares as a smoothing function.

The argument now is best seen in terms of the frequency spectrum. You are assuming the observation is some signal with a limited bandwidth, plus noise with a much broader bandwidth that goes to zero at the origin (i.e. has zero mean). Above a certain frequency, the noise amplitude is higher than the signal amplitude, so if you cut off frequencies higher than this point, you lose more noise than signal and the accuracy improves. Even if you don’t know what the cut-off point is, low-pass filters generally improve accuracy, and are easier to understand. But again, it’s dependent on commonly unstated assumptions about the signal you’re looking for that can lead to confusion if the assumptions are violated.

I’m guessing when you said “curve-fitting” you meant “smoothing”.

Nullius,

No, I’m fairly certain that I mean curve fitting. Smoothing is usually something else entirely, although the techniques can overlap.

Briggs,

Incidentally I have never liked the term normal distribution, although not for the reason you might first think. It is that you never know whether the person means binomial or Gaussian. It doesn’t manner what the correct definition is but the fact that you never know.

@Scotian: “All the points lie off the curve. We are just looking for the best fit.” I believe most of what Briggs is saying has to do with the “best fit” part. That’s where the normal part comes in: it’s a line that minimizes the sum of the squares of the Y distance of the points to the line, and if you’ve done it right, these distances will have a reasonably normal distribution.

Nullius,

The probability y takes any value, … is 0The error is in treating your observation of y as a real number of infinite precision, with no uncertainty, and then treating the probability distribution as purely a function on such infinitely precise points.Any given value is infinitely precise. X=2 means 2 and not kinda close to 2. Kinda close should be expressed X=2 +/- e or similar.

DAV,

Are you talking about discrete or continuous variables?

@Scotian: I wish we could edit comments here. In my previous comment, “it’s a line…” would be clearer if I expanded it to “the best fit is a line…”.

In terms of your question, “I understand most of what you are saying but am a little unclear as to why you call it the real regression.” I believe Briggs is referring to how people use regression with the wrong emphasis. That is, people often focus on the “significance” and accuracy of their coefficients and ignore the overall meaning, assumptions, and actual accuracy of their regression. Using his example, they say, “b1 > 0, p=0.002, which means that men have significantly higher GPAs than women.”

Nullius,

I’m talking about what “any given value” means (as in

The probability y takes any [given] value … is 0.) vs. “any given range of values” (as inThe probability y takes any [given] range (range size > 1) … is 0.).Yes, but are you talking about a discrete variable taking any given value or a continuous variable taking any given value?

I don’t know what point you’re trying to make. You appear to be saying exactly what I just said, with the air of one correcting an error. But I’m not sure if I’m misunderstanding or you are, or what statement is being misunderstood.

@Scotian, the least-squares procedure is based on the assumption that the your points {x,y} can be described by yi(x) = f(xi; b0, b1,…) + N(0,si). You may not think of it that way. If the “errors” in your observations aren’t distributed normally, then the least-squares procedure is bogus.

Yes, but are you talking about a discrete variable taking any given value or a continuous variable taking any given value?Well, the quote you made (from #4 above) was referring to properties of the Normal distribution which is continuous. What kind of values would we be using?

At any given time y (from #4) can only take on one value. That value may lie within a specified range but once you state the value, it becomes infinitely precise and the probability of y having that value is exactly equal to zero.

To all,

Thanks for the clarifications with respect to the Gaussian nature of the least squares technique. I suppose this makes sense given that both were discovered by the same mathematician – a feeble attempt at humour. Where we might differ, however, is in the assumption that the deviations from the fitted curve represent experimental error that follows the Gaussian or indeed any distribution. They may simply be, and in most cases are, accurate physical measurements of the underlying reality with the experimental error being much smaller than the deviation. We fit the data to a curve justified by a theoretical model that is good enough for our purposes. Our fitted formula may also be entirely empirical and is chosen for ease of future calculations and whose predictive power is good enough. This sort of thing occurs all the time in electronic circuit design and weather forecasting.

“At any given time y (from #4) can only take on one value.”Ah, so are you thinking y is an *observation* or *objective reality*?

Observations of continuous variables are necessarily approximate. (If you had been talking about a discrete variable, like the number of apples, that could be exactly and observably 2, which is why I was asking.) “Stating” the value to be an exact real number like “2” doesn’t make it so, and (with probability 1) is simply untrue.

So do you mean the actual “true value”? Even if we suppose reality is faithfully represented by a Real continuum, we don’t have any physical access to that information by any experiment, instrument, or theory. It cannot be the observable outcome of any experiment, which is surely what the probability is supposed to be describing, yes? Again, the issue arises from a physically unrealistic assumption that we can work with an infinite amount of information.

And reality is quantum, and probably not a continuum below 10^-35 metres anyway. (And physical variables can indeed take on more than one value at once. Assuming zero-size point particles leads to inconsistencies and infinities in physics, too.) The Real numbers are a mathematical model, no more.

@Scotian: “Thanks for the clarifications with respect to the Gaussian nature of the least squares technique.”

“Where we might differ, however, is in the assumption that the deviations from the fitted curve represent experimental error that follows the Gaussian or indeed any distribution.”

These two statements are contradictory, if you’re doing your fitting via OLS. The deviations from the curve in fact represent error. It may be the error of your measurements, or it may be the error of your assumption that your curve is a good approximation of the real world, or it may be the effects of omitted measurements that you should have acquired. It’s all lumped together as “error”, and assuming its random (i.i.d.), you should end up with a normal distribution for your errors.

Any other distribution means your assumptions are wrong. You fit a linear regression to something non-linear, or you used the wrong variables, or you didn’t use the right variables, or you didn’t pay your laboratory staff enough for them to care about their measurements, or…

Whatever the case, the fact that you like to put pretty lines through things and ignore the mechanics of what you’re actually doing means you need to memorize the original posting.

Ah, so are you thinking y is an *observation* or *objective reality*?Not at all.

What Briggs was referring to in # 4 is an oddity encountered with all continuous distributions and not just confined to the Normal distribution. Suppose, using our continuous distribution, we determine the value of y lies between two limits with some probability. However when we individually examine the infinite series of values between those limits (y1,y2, …

und so weiter) we discover that the probability of y=y1 is zero and the probability of y=y2 is zero and, in fact, the probability of y being any of them is zero.This leads to the conundrum that, while y might lie between the limits with probability=X (given our distribution), the same distribution tells us y can’t take on any of the values between the limits. Yet we get a probability greater than zero that y is within the limits.

A pair of ducks to be sure.

I think I agree with Scotian. Assumptions of normality are not necessary for least squares techniques to be effective. Even when the residuals are not normal, and even when they do not have mean zero, least squares can provide accurate estimates of parameters for large N. Consider the Matlab/Octave code:

N=10000000; % number of points

x1=rand(N,1); % random data x1, pseudo-uniform iid

x2=rand(N,1); % random data x2, psuedo-uniform iid

a1=rand(1); % random coefficient a1

a2=rand(1); % random coefficient a2

y=a1*x1+a2*x2; % create data y

% least squares estimate of a1 based on x1 alone

a1_est=sum((x1-mean(x1)).*(y-mean(y)))/norm(x1-mean(x1))^2;

Here x1 and x2 are meant to be independent uniform distributions of [0,1], while y is a linear combination y=a1*x1+a2*x2. What if we didn’t (or couldn’t) collect x2, and ran the regression using just y and x1? Running the code, we see we get an accurate estimate of a1. Yet we have not included all the true underlying variables and the residual error follows a uniform distribution with mean a2/2.

Yes, I know what Briggs was referring to. The problem with the argument is that it assumes you can have infinite precision in one part, but denies the infinite consequences of the assumption in a different part.

An event having probability zero doesn’t tell us that it can’t happen. It’s like arguing that a square cannot have any area because a square is made up of points and points have no area. Or that a square cannot be made up of points because it has area and points don’t. It doesn’t follow. Mathematically, events with zero probability can happen, like a collection of points each with zero area can have non-zero area. Having zero area doesn’t mean points can’t exist. It’s the same mathematics behind both.

There are people, I suppose, who would deny that lengths and areas can be continuous, too. (And they’re right to the extent that they’re a physically impossible mathematical idealization.) Zeno’s arguments were very persuasive for a lot of people.

An event having probability zero doesnâ€™t tell us that it canâ€™t happen.Oh? What do you suppose it means?

—

The truth is, no one really uses continuous distributions — not even people who think they do. Instead, they chop it up — sometime so finely that it may possibly resemble the original when viewed with blurred vision — and use the resulting discrete distribution.

@Scotian in discussion with some other posters

Am I correct to interpret Scotian’s comments as a question about why “real regression” is the type that is used as a tool in statistical significance testing for scientific inference in the social sciences? I take regression to be any attempt to evaluate the credibility of a posited functional relation between (vectors of) variables (usually representing measurement outcomes).

The confusion may arise from the following:

Physical theories predict point-values, usually with great accuracy. To evidence the Higgs boson a 5sigma rule was used, which is rather uncommon (or so I’m told), because they do not normally need a “statistical significance” criterion. Evidencing Higgs means that a theory (standard model) predicted that a phenomenon (Higgs Boson) should be observable in a measurement context (LHC) of which the measurement outcomes were calculated as a part of the prediction. The difference between predicted value and outcome of the measurement procedure (residual) was less than 5 significant digits. Such residuals are studied and not necessarily called error. For instance the the difference in the QED prediction and measurement of an electric dipole moment anomaly (values differ only beyond the 7th decimal) is something that can be understood in QCD (or so I am told).

So, the fact that the list of causes mentioned by Wayne above are accepted as unknown sources of the residual between prediction and measurement, is due to empirical inaccuracy of theories produced by the social sciences. A scientific theory should at least predict which functional relation to expect including the distribution of the variables involved.

Related:

The Gauss-Markov theorem and Laplace’s proof are about repeated measurements of the property of one object of measurement (position of celestial bodies in Gauss case). This sample of measurement outcomes, is very different from what is used in many of the examples given here: They usually concern a single property measured once, but in several different objects of measurement.

The two types of samples of the property are the same only if the ergodic condition applies, but then the irony is that the measured property only exists at the level of the sample/population and can’t be attributed to individuals constituting the sample/population.

Therefore this is an incorrect statement:

“We want to say how sex informs our uncertainty of a personâ€™s GPA.”

It may be what you want to say, but you can’t! You can only say: “how the average distribution of gender in a population informs our uncertainty of the average person’s GPA”

Now, I’m pretty sure the systems social science studies are not ergodic systems. I think that should have been number 1 of the primer: Find out if you can defend the conjecture that you are studying an ergodic system, if you cannot, find other tools for scientific inference. That will stop abuse of regression right there.

That being said, it’s an excellent idea to study the points of the original posting if you’re in in inaccurate science like I am.

An event having zero probability means that it has smaller probability than any event with non-zero probability.

The empty set is the impossible event. The argument being used here is saying that the impossible event has zero probability, and a point event has zero probability, therefore a point event is impossible.

It’s exactly the same argument as saying the empty set has zero area, a point has zero area, therefore a point is the empty set.

It doesn’t follow. Just because an impossible event has zero probability does not mean that an event with zero probability is impossible. Probability is not the whole story, there’s a lot going on in the fine detail that it blurs.

An event having probability zero doesnâ€™t tell us that it canâ€™t happen.An event having zero probability means that it has smaller probability than any event with non-zero probability.So then, are you saying an event with zero probability really has a non-zero probability of occurrence?

The argument being used here …I believe the argument here is that continuous distributions are only useful as guides in constructing the discrete distribution you will actually employ.

No, an event having zero probability has zero probability. But that doesn’t mean it’s impossible.

Like a point with zero area has zero area, but it still exists.

I think the point is that probabilities, Real numbers, sets, points, areas, and so on are all mathematical idealizations used because they are far easier to work with than reality. Considered as a mathematical model, continuous distributions work fine. Many people have incorrect intuitions about them, but there’s no actual inconsistency.

Continuous distributions are functions on intervals, which are what we will actually use. (All observations of continuous variables being intervals as I noted above.) A discrete distribution is something different. The difference doesn’t always matter, but it’s an added and unnecessary complication, and quite often gives you the wrong answer, or doesn’t allow you to answer at all.

Wayne,

Quote “Itâ€™s all lumped together as â€œerrorâ€, and assuming its random (i.i.d.), you should end up with a normal distribution for your errors.” It will not be random nor independent. Why should it be? An example would be using regression to calculate a straight line limit to otherwise non-linear measurements.

Quote “You fit a linear regression to something non-linear…..” This is always the case. Why do you think that it could be otherwise? See my comment about theoretical modelling.

As to your final sentence I can only say: tsk, tsk, behave yourself.

Nullius in Verba,

I hear you. For students with calculus background, below is how I explain the two common questions: (1) Why is the probability of a point (for a continuous variable) is zero? (2) What is the nozero

probability density function (pdf)about?The probability of an interval of a normal distribution (or any other continuous distributions) is the

areaof the corresponding segment contained between the x-axis and the normal pdf curve. (Integration.) The probability of a point therefore can be seen as the area of over a single point, which is zero. Why? See http://math.stackexchange.com/questions/410541/integration-and-area-why-is-integrating-over-a-single-point-zero. No practical meaning, really.What is a pdf ( f(x) )? Why is it not zero at a single point? On can see the pdf, f(x), as the derivative of its cumulative distribution function. Click http://en.wikipedia.org/wiki/Fundamental_theorem_of_calculus and go to the section of Geometric Intuition. Basically, by definition of derivative, it says that P[ X in the interval (x-h, x+h)] can be approximately equal to f(x)*h, with this approximation becoming an equality as h approaches 0 in the limit. That is, pdf can be interpreted in terms of observing values falling near by a point x.

Why do we use continuous pdf to quantify the uncertainty due to measurement error, ignorance, and random error, and so on? It could be that theoretically the error can assume any values in an interval. It could also be that the probability distribution of the error can be approximated by a normal distribution well. After all, in regression, we approximate!

Mr. Briggs,

Here are some corrections you might want to think about when you write your book.

6. Correct notation: y | I(sex=Male), b0, b1,s, ~ N(b0 + b1*I(sex=Male), s).

Note that I(sex=Male) is a observable indicator variable. Hence the correction below.

7. b0+ b1*I(sex=Male) is not a central parameter. It is a function of the observable indicator and parameters, more appropriately, the mean function of y | I(sex=Male),b0, b1 in the normal distribution case.

Let me jump to 17.

Below is an example why a probability of near 50% doesn’t imply that gender tells us nothing about the response variable.

Suppose the response variable for male Y|male ~ N(2,3), and for female Y|female ~ N(2.5,4), Theoretically, P( Y|male < Y|female) = 0.54. Can we conclude that the gender is not useful in predicting the response variable? No, since that gender has a clear influence on the response. Google search for Bayesian variable selection methods.

I think that if y = sum_i a_i x_i and the x_i have pairwise zero covariances, then a_k = E( (x_k-mu_k) (y-mu_y) ) / E( (x_k-mu_k)^2 ). So it seems to me that the existence of the variances and covariances is enough for the least squares formula to yield an estimate of a_k for a linear model. We don’t need normality or the inclusion of all independent variables to use least squares.

Briggs:

You state that y ~ N(m, s) does not imply that y is normally distributed. You don’t say what it does imply, if anything. Others argue that y is “approximately” normally distributed. In this argument, “approximately” plays the role of a polysemic term, that is, a term with more than one meaning. Thus, the above stated argument is an example of an “equivocation”: an argument in which a term changes meaning in the middle of this argument By logical rule, a proper conclusion cannot be drawn from an equivocation. To draw such a conclusion is an “equivocation fallacy.”

y ~ N(m, s) is an untestable claim, leaving regression analysis unsuitable for use in a scientific study. Unfortunately, this claim is ubiquitous is studies claimed to be scientific.

“7. b0+ b1*I(sex=Male) is not a central parameter. It is a function of the observable indicator and parameters, more appropriately, the mean function of y | I(sex=Male),b0, b1 in the normal distribution case.”

It’s not a parameter? I thought the OLS assumption was that Y = XB + e, where e ~ Normal(0, sigma^2). If you have some normal distribution Z and you transform it via Z + 2, you still have a normal distribution but its first (mu) parameter changes. OLS thus implies that you are adding XB to a normal distribution, which further implies that Y ~ Normal(XB, sigma^2). The linear predictor is a parameter of a distribution.