Regression Is Not What You Think. Climate And Other Examples. Ithaca Teaching Journal, Day 8

Be sure to first read Statistical Significance Does Not Mean What You Think. Climate Temperature Trends, as this post is an extension of it.

As yesterday: If y is some temperature and t time, a simplified model might be

    yt = β0 + β1t + ε

where we make the usual assumption that our uncertainty in ε is characterized by a normal distribution with parameters 0 and σ. If β1 is positive we announce that there is a “trend” in the data. Once more, this is to speak improperly. No model in the world can tell you if there was a trend with certainty. But a simple glance at the data can. Just look!

Please understand that it makes no difference how many complications you add to this model (e.g., “correlated residuals”, etc.), everything said below still holds.

Another example: If y is income and x the number of years of education, a standard model is

    yt = β0 + β1x + ε

where once more the usual assumptions. (How to pick the “best” x’s, I’ll answer another day.)

Incidentally, all normal distributions1 are characterized by two parameters, a central parameter which tells us where the peak of the density is, and a spread parameter which controls the width of the distribution. I am not being pedantic when I insist that these are not to be called the “mean” and “standard deviation.” Those two objects are functions of the observable data which are often used as classical estimates of the parameters. But they are not the parameters.

The problem is that since the parameters do not exist, we can never know whether our guesses are accurate. I repeat: parameters are not observable. They are entirely metaphysical. Their placement in equations is the logical consequence of the premise that a normal (or other distribution) is assumed to characterize our uncertainty in some observable (like y).

The observable y itself is not “normally distributed.” No thing in the world is “normally distributed”, nor “binomially”, nor any-other-distribution-ally. It is true that we often say, perhaps after looking at graphical evidence of y, that y is or is not “normally distributed.” But this is to speak incorrectly. The observable takes whatever values it does because of some causal process. Each, and yes, every instance of y is caused. Probability distributions cannot cause.

Our slogan is: end the slavery of reification! (I’ll speak more on this another day.)

What we should say in regression (or any case in which we use probability) is that our uncertainty in y, for a given value of t or x, is characterized by a normal distribution with parameters

    y ~ N(β0 + β1t, σ)

or

    y ~ N(β0 + β1x, σ).

Regression, then, is a model not of the observable y but of the central parameter of the normal distribution which characterizes our uncertainty in y. That line we usually draw over scatterplots to indicate regression is a line of the central parameter given varying values of t and x. To emphasize: this line says nothing directly about the observable y; it is itself unobservable, metaphysical, a fiction.

Even if we assume we know the values of the parameters β0, β1, and σ, we still know nothing directly about y. Thus, when we speak of “residuals”, which are had by plugging in classical guesses for the parameters and “solving” for y, we speak incorrectly. What we are solving for are the values of the central parameters, for given values of t and x.

We cannot solve for y in any probability model. Plus, there is no reason to. If we want to know what the values of y were, all we have to do it look!

What we can do is to assume our model is true and use it to quantify our uncertainty in values of y we have not yet seen. We do not need a model to quantify uncertainty in data we already know! We properly speak of our uncertainty in new values by “integrating out” the parameters of model and (as yesterday) producing statements like this:

    Pr (ynew > a | tnew, old observed data, model true)

or

    Pr (ynew > a | xnew, old observed data, model true)

Where a can be any number we like, even an interval, and we assume that our model is perfectly true, flawlessly true, just plain true. There is no information in these probabilities about whether the model is true. Notice that there are no parameters in these equations.

It is also improper to say that our model “makes predictions of y.” It does not. It gives us the probability that y takes certain values. This probability can be turned into a prediction of a unique y, but only after we marry the predictive probability distribution with a measure of how important prediction mistakes are to us. That is, a prediction of a unique y, since it is a distillation of the probability that y takes any value, is a kind of decision, and to understand those we need to enter into the subject of “decision analysis”, which I won’t do here today.

Suffice to summarize: We start by observing pairs of (y , t) or (y , x). We need never use probability to talk about these values: they are there, lying open; whatever knowledge about them we want can be had just for asking. But we do not know what values y will take when we observe new values of t or x. Thus we must characterize our uncertainty in new y given new t and x, and assuming our model is true.

The parameters of our model are only of tertiary importance; they are dull things; they cannot be seen; they do not exist. Best to “integrate them out” and speak directly of our uncertainty in actual observables, like y, t, and x.

Technical Facts For Geeks

We might infer the posterior values of β0, β1, and σ to sufficiently high certainty. But the certainty in these parameters does not imply that we have sufficiently high certainty in new values of y (given new values of t or x and assuming our model is true).

Researchers often issue statements about their models, but talk only of their uncertainty in the parameter values. This certainty is “transfered” to certainty that the model is true, or that the new values of y can be predicted just as certainly. This is false. As in not true. As in not so.

Thus, climate temperature modelers might go on about how Pr(β1 > 0 | old data, model true) is high, and say therefore it is also highly probable that y itself will increase substantially in the future. Again, this is not so. We can know (by assumption, for instance) the value of β1 precisely, but this does not mean we know future values of y with precision.

Even stronger, all these probabilities assume the model itself is true. Once more, there is no information in the old data which can prove a model is true or false. It is always an assumption—or an inference (of the kind frequentists are forbidden to make!).

What we can do is use the model to characterize our uncertainty in new values of y; then, after incorporating these probabilities into a decision process and producing guesses of new y, we can wait until we see new values of y and then see how useful the guesses were. And that is it.

We can, of course, compare the usefulness of other models. But—and here is the subtlety—to pick the “best” model from a suite of competitors is itself an inferential process, just like our original regression.

But enough! I can’t possibly re-create an entire theory of probability in one post.

———————————————————————

1We’ll also never mind that normal distributions are always an approximation for our uncertainty: we always make a mistake of some kind when using a normal. Let he that readeth understand.

15 Comments

  1. Will have to read this one a few times.
    Initial impression is Briggs went off the deep end.
    Thermal noise is the most reliable signal source I have in the electronic equipment
    I deal with and is normally distributed.
    Then there is the problem of believing quantum mechanics is a correct treatment
    of reality whether I like it or not.
    Yup. Will have to read this one o few times.

  2. So in a nutshell, if I understand correctly, is that the uncertainty of future values of an observable is due to uncertainty of the the observation noise plus the uncertainty due to the parameters. And then also that normal people don’t understand this simple point?

  3. I think my college physics professor summarized your post when he said, “The model is not reality; the data is reality.” Or, as George Box famously said, “all models are wrong, but some are useful.” The quantities we compute in statistics are called “estimates” for a reason! A goal of statistics is to try to determine what we may validly infer about those unknown, unobservable, parameters based on the data that we have observed and the statistics (estimates) that we can compute.

  4. We plug in our data into our software package, and the software gives up back its estimates to build our linear model B0, B1, and S. The software also gives us secondary statistics — t-stats, the F-stats, R2, sum of residuals squared, etc.

    The classical statistician would use the secondary statistics to answer the question of “are these parameters for my model a sufficiently ‘better’ description of the data, than the ‘null hypothesis.”

    I gather your point is:

    None of these statistics can tell me if any observed pattern in the data will be repeated going forward. And, this question as whether to reject the null hypothesis is a silly one.

  5. The basic concepts of statistical modeling explained in this post are the same for Bayesian and classical approaches, though there is a difference in the way of modeling the parameters.

    Both are trying to make inferences about unknown quantities using available information assuming there is an underlying process, e.g., a linear equation, y(t) = beta;_0 + β_1 x(t) + ε(t).

    Both start with a likelihood function (though there are non-model based classical analyses), e.g.,
    y | x, βs ~ N(β_0 + β_1 x, σ) , instead of y ~ N(β_0 + β_1 x, σ) . ^_^

    It would also be helpful to explain more about the term ε(t), a key component.

    I think the best way to compare the two approaches is to first introduce their statistical procedures and how inferences are made, say, using the global temperatures from about 1980 to present.

    After all, Bayesian mechanics (without involving the computational aspects) is quite simple.

  6. First, let me say I’m enjoying these posts, you’re helping a lot in telling colleagues about this stuff.

    Something strikes me with this one though. I agree that all we can say is Pr (y_new > a | t_new, old observed data, model true), but what I cannot get is what you say toward the end:
    “We might infer the posterior values of β0, β1, and σ to sufficiently high certainty. But the certainty in these parameters does not imply that we have sufficiently high certainty in new values of y (given new values of t or x and assuming our model is true)”

    Isn’t “Pr (y_new > a | t_new, old observed data, model true)” just a measure of “certainty in new values of y (given new values of t or x and assuming our model is true)”??

    Thanks

  7. Bill S,

    What I said is in no way inconsistent with quantum mechanics. Many QMers adopt a Bayesian position (search arxiv). Thermal noise isn’t “normal”, but the uncertainty you have in certain measurements can be quantified with a normal distribution.

    Travis,

    Once again, no “noise” appears. The regression equation is an equation of the central parameter of the normal representing our uncertainty in the y’s.

    Rick,

    Box was wrong. To say “all models are wrong” means you have proved deductively that all models are false—and you make this statement with the same rigor as any mathematical theorem. We can’t look inside the model to see if it’s wrong. Click my Stats/Climate tab and look for the similar topic.

    Doug M,

    Amen, brother.

    JH,

    In parametric Bayesian analysis, what you say about inference is (partly) true. But two problems (1) Classical inference is completely different philosophically, and (2) I am not advocating analyses based on parameters. That is, I view statements about parameters are either false, misleading, or apt to produce over certainty. I am advocating looking only at observables, at quantifying uncertainty in observables we have not yet seen.

    I say classical statistics is based on inconsistent, and at times provably false philosophy. It should be abandoned. For example, how about telling the probability of the conclusion, “Joe wears a hat” given the premises, “Half of all Martians wear a hat and Joe is a Martian.” Frequentist statistics fails here: it just can’t do it.

    And then substitute “New Yorkers” for Martians (if you feel Martians are cheating). You still fail.

    Julian,

    Good question. What I meant was that we can know the values of the parameters to whatever precision you like (conditional on some information), but this precision in our knowledge of the parameters does not translate to an equal amount of certainty in (1) future values of the observable, or (2) the truth of the model, or (3) some observable hypothesis that is a consequence of assuming the model is true.

    All,

    Stay tuned for more examples and clarifications.

  8. What you are saying seems straight forward enough. Often in physics we fit measurements to a theoretical equation in order to determine parameters of interest. Any predictive power is in the theory not the data. The data can tell you if your theoretical analysis is way off, but is often of marginal use if you have too many adjustable parameters. At the end of the day however, one is reminded that physics is an experimental science. Since most measurements in physics are a lot more reproducible than the examples that you tend to discuss, you should not universally dismiss the predictive power of an empirical fit to experimental data. Electronic circuit design works almost entirely on this basis.

  9. William Sears,

    I in no way, and by no stretch, dismiss “empirical fits.” I only emphasize the proper interpretation and use.

  10. That is, I view statements about parameters are false, misleading, or apt to produce over certainty. I am advocating looking only at observables, at quantifying uncertainty in observables we have not yet seen.

    You’ll have to explain bout why statements about parameters are false and … Consider a line y=a+bx, the slope b obviously has its meaning and usage. An example will help, if you have time that is.

    Anyway, find the right venue and start selling!!! Maybe you can use the data sets you have amassed through your consulting. Define a question of great interest and a decision/inference, of some sort importance, to be made. And then demonstrate how and why the quantified uncertainty in observable (quio) can help answer the question and make the decision.

  11. To BIll S.
    Once a system is clean, highly controlled, and all major sources of variation are removed you often end-up with a bell-shaped curve whose central section resembles a normal curve. See Julian Simon’s “what does the normal curve mean”. If you are interested in the tails, the normal distribution fit is not so hot. That is why the term asymptotic pops up so much in statistical proofs.

  12. Bill R
    Will not argue about tails of Gaussian.
    But if you aim an antenna at the stars the noise you see in your real life
    Signal processor has a mean that can be modelled by a normal distribution
    And a standard deviation that is close enough to a Gaussian that only a
    Pedantic will complain about.
    FYI.
    The reason I keep up with this blog is because I cannot teach a computer
    How to tell the difference between an outlier and corrupt data!
    Waiting for Briggs to give me free education!

Leave a Comment

Your email address will not be published. Required fields are marked *