Be sure to first read Statistical Significance Does Not Mean What You Think. Climate Temperature Trends then Regression Is Not What You Think. Climate And Other Examples, as this post is an extension of them.
This post will only be a sketch, and a rough one, of how to pick explanatory variables in regression. The tools used work for any kind of statistical model, however, and not just regression.
First remember that a regression is not a function of observables, like y, but of the central parameter of the normal distribution which represents our uncertainty in y. This language is tangled, but it is also faithful. Regression is not “empirical fits” or “fitting lines to the data” or any other similar phrase. We must always keep firmly in mind that we are fitting a model to parameters of a probability distribution which itself represents our uncertainty in some observable. Deviations from the language I use are part of what is responsible for the rampant over-certainty we see (a standard topic of this blog).
Here is a regression:
yt = β0 + β1x1 + β2x2 + … + βpxp + ε
where each of the xi are potentially probative of y. It is up to you to collect these x’s. How many x’s are there for you to consider in any problem? A whoppingly large number. A number so large that it is incomprehensible.
Anything can be an x. The color of the socks of the people who generated or gathered the data can be an x. The temperature in Quebec might just be correlated with your y. The number of monkeys in Suriname could be predictive of your y. The number of nose hairs of Taipei bus drivers could be relevant to saying something about your y.
You might think this silly, but how can you know these x’s are not correlated with your y if you don’t try? You cannot. That is, you cannot know with certainty whether any x is uncorrelated with your y unless you can prove logical independence between x and y. And that isn’t easy: it really can’t be done in any except mathematical and logical proofs using highly defined objects. For real-world (contingent) data, logical independence is hard to come by (see the footnote).
A sociologist (whose name I cannot look up because I am too pressed for time) said words to the effect that, in his field, everything is correlated with everything else. This is only a slight exaggeration. In any case, it remains true that for any contingent x and y, logical independence1 is denied us and so we must instead look to irrelevance.
Irrelevance is when, for some propositions x and y (data are observations statements, or propositions),
Pr(y|x & E) = Pr(y|E).
That is, the probability of y given some evidence E remains unchanged if we consider we also know x. This tells us that to say whether an x should be in our regression equation, we should examine whether x is relevant or irrelevant to knowing (future) values of y.
Recall that our goal for any probability model is to make statements like this:
Pr (ynew > a | xnew, old observed data, model true)
where we pick interesting a’s or we pick other questions about ynew which are interesting to us. If this probability is the same if we do not condition on x, then x is irrelevant to y and should not be included in our model. In notation (for our math readers): If
Pr(ynew>a| xi, other x’s, old data, model true) = Pr(ynew>a| other x’s, old data, model true)
then xi is irrelevant to y so it should not appear in our model.
Classic statistics tells us you cannot say which x’s you should include unless you first do a hypothesis test. This is a highly artificial construct which often leads to error. That is, some x’s can be relevant to knowing y even though the p-values of those x’s are larger than the magic number—oh, 0.05! how I love thee!—and some x’s can be irrelevant to y even though their p-values are less than the magic number.
And this holds equally for Bayesian posterior distributions of the parameters. Some x’s can be relevant to y even though their posterior probabilities show a large probability of not equaling (or being near) 0, and other x’s can be irrelevant to y even thought their posterior probabilities show a large probability of equaling (or being near) 0.
In other words, relevance as a measure of model inclusion does away with all discussions of “clinical” versus “statistical significance.” It also removes all tricks, like the one in where if you increase your sample size you guarantee a publishable p-value (one which is less than the magic number).
Relevance is the fairest measure because it puts the decision directly in terms of observables—and not in terms of unobservable parameters. We ask questions of the y that are meaningful to us—and these questions will change from problem to problem. We create the questions, not some software package. We need not rely on a one-size-fits-all approach like hypothesis testing or posterior examinations. We can adapt each analysis to the problem at hand.
Why isn’t everybody jumping on the relevance bandwagon. Ah, this is it. Ease. The relevance way puts the burden of decision making squarely on you. It is (as we shall see when I do examples later) more work. Not computationally; not really. But it doubles the amount of mental effort an analyst must put into a project. It makes you really think about what probabilities mean in terms of observables. It also removes the incredible ease of glancing at p-values (or posteriors), of having the software make the decisions for you.
But this is the least of it. Far, far worse is that relevance absolutely destroys the goosed up certainty found in classical (hypothesis testing and posterior examination) methods. Whereas before relevance, you might find dozens of x’s that are “highly significant!” for explaining y, with relevance, you’ll be lucky to find one or two, and those won’t be nearly as exciting as explaining y as you had thought (or hoped).
And that is bad news for your prospects of publishing papers or developing new “findings.”
—————————————————————————————-
1Logical independence exists if and only if each of the conjunctions “x & y”, “x & not y”, “not x & y”, and “not x & not y” are not necessarily false. It is also the case that some logically independent x and y, the x (or y) might be relevant to knowing y (or x).
You are looking at a linear regression with independent variables. A nonlinear regression and/or interdependent variables would be even worse.
(Wish the discussion were in a seminar setting where I can get an instant feedback from the expert and audience.)
Based on my statistical intuition and without putting any serious thought yet, I have three questions.
1)The choice of a.
2)The two probabilities are unlikely to be the same in real world applications, so I would think that one needs to establish a cutoff point instead.
3) I would be worried if the relevance of xi is to be decided using the probability concerning a single new observation… anyway, need to stare at the above statement more.
Were you referring to the psychologist Paul Meehl?
So far, we are talking about linear models where want to determine P(y>a| x ).
What do we do, if I don’t know the values of x?
How many x’s is it reasonable to throw into a model based on a data set of limited size.
My company pays for a model of security returns that is based on about 500 factors. However, this model is build on only a couple hundred observations of each x and y. It seems like it should be impossible to build a model with more factors than observations.
bill r,
The very man. Thanks.
JH,
(1) I do not mean to imply we can or should only ask question about the probability that “ynew > a”, but that we ask questions which are relevant to the situation at hand about ynew (which might ask about ynew being in an interval, or exceedingly some value, or whatever).
(2) God help us if we do. We’ll have another “0.05” situation on our hands. The difference in the probabilities important to one person or to one decision about y, might be utterly unimportant to another person or another decision about y.
(3) You misunderstand. And this is Doug M’s question, too. That xnew is not necessarily a newly observed value of x (where x may be just one x or a whole slew of them). We calculate the probabilities assuming we might see these new x. It is we who decide what range or values the new x will take. Our goal is not to learn about the x, but the not-yet-observed values of y.
After all, we do not need to calculate the probability y takes any value given the observed values of x for the previously observed data. Because all we have to do is look.
Also do not forget that all probabilities are conditional on whatever assumptions, premises, evidence that we specify. And only on these.
hmm… Rarely does one get P(y|x & E) = P(y | E) — or more commonly, zero when subtracting the two — but instead gets a value around zero as JH has said. So the question becomes: how small is small enough? It’s the question “statistical significance” was supposed to answer. Calling P(y|x & E) = P(y | E) “relevance” (or “independence”) doesn’t really solve the problem.
Doug M,
More variables tends to muddy the answer a bit not to mention making computation harder. If the variables aren’t independent you can end up with singularities when attempting regression. OTOH, “irrelevant” variables tend to wash out in Bayes P(H|E) = P(E|H)P(H)/P(E) though even then multiply correlated variables can bias the result by adding weight to a particular answer.
Note how I slid from a regression model, Y(x), into a probability model, P(y|x). That’s what a regression formula tacitly does as Briggs pointed out. Think of the x’s as voters of y. How many pieces of evidence are needed to determine P(OJ=murderer|E) with only a single observed murder? **
** On a side: trials based on “preponderance of evidence” are often enormously unfair. It seems the largest factor in determining guilt is P(defendant guilty | defendant is on trial) approaches 1 for a lot of jurors. They forget that the police chose the defendant out of a possible pool of thousands based on the other evidence which they are supposed to be evaluating. P(defendant guilty | defendant is on trial) is P(H) in the above Bayes formula.
DAV,
Well, that’s the secret. No, this brand of “realism statistics” doesn’t solve the problem. Nothing does. There are no easy answers. Uncertainty will always be with us.
Why do you have to use words like “probative” Mr. Brigs. It is a legal term as in tending to prove.
I can now see your dificulty with the world. Your ideas are interesting. Your language is somewhat turgid. It sows misunderstanding. Try to write like Fyneman. Simply and dirrectly. You may become successfull. But the probability is small.
Someone’s slip is showing and for some reason I’m suddenly humming Blue Oyster Cult‘s Godzilla to myself. “Oh NO! There goes TCO!. Go, Go, Godzilla”.
Even though probative means to test, try, or prove, and is thus entirely appropriate in this context, I’ll stick to using easier words for you, George Steiner.
Bill S,
Your language is confused. What you might have meant was that your uncertainty in the values of some signal was well characterized by a normal distribution. You also thought that the parameters of this normal could be well approximated by substituting in the mean and standard deviation; and that you were pleased to say that these approximations suffered no uncertainty. Thus, you are happy with your approximation. Well, who am I to take away your joy?
But read the 23 July 2011 post.
Mr. Briggs,
(1) Whether the xi (the ith variable) is relevant depends on the choice of “a,†and the choice of “a†depends on the situation at hand (by who)?! Hmmm… I think that SUBJECTIVITY has hues of bias and over confidence.
I would consider using something like Kolmogorov–Smirnov statistic or computing the non-overlapping area between the two curves.
2) I would run simulations to verify whatever cutoff point you have in mind first. I rather like 0.3. When the weather forecast reports there is more than 30% chance of rain, I bring umbrella. (Just kidding).
(3) I realized I had thought wrong, but not the way you described. I thought why not consider P [ y(T+1) > a, y(T+2) > b, … | all necessary other stuff ]. At first glance, I was thinking about time series data (t = 1,…, T)… you see, you have used the subscript to represent different things.
(Note that y_t = β_0 + β_1x_1 + β_2x_2 + … + β_px_p + ε and y(t) = β_0 + β_1x_1 (t)+ β_2x_2(t) + … + β_px_p(t) + ε(t) represent two different models.)
JH,
Subjectivity, eh? Tell me: how do you pick your models? Objectively?
Rich,
The statement I made is clearer than yours. Those distributions, in fact, contain everything we know about the observables given our assumptions the model we used is true. Recall that all statements of knowledge and probability are conditional on specified premises—and only on the specified premises.
Jeff,
Not so. The results I presented are so common as to be ubiquitous. Look inside any journal of political science, sociology, psychology, etc. and you will dozens of instances.