Since the subject has come up so often, today a note on the words correlated and correlation. They have technical definitions and plain English meanings. The two definitions overlap but they are not equivalent.
Suppose you have these two propositions: X = “Jack has an IQ of 107” and Y = “Jack makes $72,000 a year.” And you wonder, does Jack’s IQ have a bearing on his salary? Or does Jack’s salary have a bearing on his IQ? Higher incomes might imply a softer lives, more leisure time and perhaps more bodily ease for the little gray cells to flourish. So the latter question might be answered “yes.”
Problem is, we can’t answer either of these two questions without making recourse to other evidence. And if we want to quantify the answers, we also have to fix our meaning of “has a bearing.” This part is simple. If we knew X or assumed X was true for the sake of argument, then given X the probability of Y being true changes if we knew or assumed X was false. This “has a bearing” captures what we mean when we say X causes Y or if X is merely related to Y but is perhaps not in the “causal path” of Y.
For instance, there might be some W that causes both X and Y simultaneously; in this case knowledge of X “has a bearing” on knowledge of Y. Or it might be that X caused A which causes B which causes C and so on right up to Y. Or this path might be reversed. But once again, knowledge of X has a bearing on our knowledge of Y, even if we know nothing directly of A, B, C, etc.
A classical statistician wondering whether Jack’s IQ has a bearing on his salary would probably venture forth and collect data on Jill’s IQ and salary, and likewise data from Bill and from Alice, and from Will and Wilma, and so on. This maneuver adds the additional information or evidence we required. Why do we require this? Well, what is the answer to this:
Pr (Y | X) = ?
This is “What is the probability Y is true given (or assuming) X is true?” It has no answer in this form. If you find yourself supplying an answer, it is because you are implicitly adding extra evidence not stated in the formula. That is, you are doing something like this:
Pr (Y | X & A) = some number between 0 and 1,
where A was mentally supplied by you. Just as it was supplied by the statistician who collected the other pairs of IQ and salaries, which also implies (this is part of the statistician’s “A”) that these pairs are relevant to Jack; it also assumes that the causal path (and our certainty in it) from X to Y is the same for all these pairs. (This sameness can be changed, as in regression say, but sameness is the first belief.)
Now imagine we make a plot of our pairs: at each observation “X = Jill has an IQ of 108” and Y = “Jill has a salary of $74,500” we make a dot at (108, 74500), and so forth. To the extent that a straight line draw through the midst of these scattered points approximates the points themselves, the higher we say the correlation is. If all the points lined up exactly on this straight line, the correlation is “1” or exact. If the points are spread from near to far and do not look at all friendly to the line, the correlation is “0” or nearly.
This is the technical definition: if our gathering of Xs and Ys can be approximated by a straight line, they are said to be “correlated” or that the two variables have “non-zero correlation.”
Now imagine a sine wave. Here we have statements like X1 = “We are at time point 1” or X2 = “We are at time point 1.01” or whatever, with Y = “The sine at time point 1 is 0.84” and Y = “The sine at time point 1.01 is 0.85” and so forth. In this case, given the additional information on the formula of the sine, we can say that X directly causes Y to take the values it does. That is (ignoring rounding error),
Pr (“The sine at time point 1 is 0.84” | “We are at time point 1” & S) = 1,
where S is the knowledge we have of the sine (see any trig or intro calculus book for this). But if we plotted1 a bunch of these Xs and Ys we would find the (technical) correlation between these Xs and Ys was somewhere in the vicinity of 0. This strange happenstance is because the extra evidence here purposely ignores S, the knowledge of the sine wave. It replaces S with some M, which assumes that, given X, our knowledge of Y is quantified by a normal distribution. Why ignore S? Well, just so we can replace it with M. If this seems odd, then know that in many statistical models relevant information like S is often ignored.
Anyway, we finally arrive at the most succinct definitions. Technical correlation is when a straight line approximates pairs of Xs and Ys. Plain English correlation is when knowledge of X changes the certainty we have in Y. Plain English correlation thus encapsulates technical correlation. Plain English correlation can also be called relevance, which is similar (but not identical to) technical “dependence.” About that, another day.
——————————————————————————————
1For once, Wikipedia has some good plots of functions like the sine where we know there is causality but where the correlation is 0 or near 0; they also have the formula for technical correlation.
If X is observed, how do we assume X was false? Typo? (I have an excuse of English being my third language…, and I am often interrupted by work and kids. What’s yours? ^_^) Or, perhaps, I simply don’t get it.
Statistical modeling uses observed data/information to postulate an appropriate model for the probability of Y given the information/data at hand X. Since we have no way of knowing whether a probability model is true, all the estimations and conclusions are based on the assumption of a given model, not of data.
So would a Bayesian statistician!
The information S, the knowledge we have of the sine, is not ignored. This is what statistical modeling is about. In this case, the information, which can be detected by a simple graph, would suggest a non-linear regression model.
Classical frequentist thinking, which is NOT cookbook all the way. (I shall see to it that this is not to be repeated. Hahah!)
JH: I am a native English speaker. The statement, “If we knew X or assumed X was true for the sake of argument, then given X the probability of Y being true changes if we knew or assumed X was false.” is correct as it stands.
Hint: crudely insinuating that someone is not thinking straight, particularly when in fact it is you who are not thinking straight in the instance, might be considered poor argumentative strategy among native English speakers.
“The information S, the knowledge we have of the sine, is not ignored” [in classical statistics]. Among native English speakers, this is called refutation by denial of the given. This sometimes works, but in any case, argumentatively, the statement is categorical in nature, such that a single counter-example would suffice to refute it.
Hint: use the word ‘many’, such as in the statement “… in many statistical models relevant information like S is often ignored.” Then refutation would require only proof that NOT (that) many ignore it.
Final hint: among native English speakers, concluding a poorly-formed argument with Hahah! is considered predictable, but tedious.
It ought to be mentioned that the four scatterplots in the illustration, due to Brian Joiner and used in the training classes provided by Oriel/Stat-A-Matrix, have the same correlation coefficient, as well as the same means and standard deviations for X and Y. Dr. Joiner used them to demonstrate the pitfalls of blind numerical calculations. Like his mentor, W. Edwards Deming, he understood that statistics without subject-matter knowledge risked becoming sterile number-crunching.
John K,
I am saying that it doesn’t make sense to say a variable X is assumed to be true or false. If X is not a variable, then Mr. Briggs shouldn’t use the notation X in the sin function example.
So please explain to me what it means to say a variable X is assumed true or false.
The typo joke was about a paper in which Mr. Briggs expressed the convergence of a term (as n approaches to infinity) in an incorrect way for which I had to point out several times, and in the end he concluded it was a typo… I didn’t think you would get the joke.
I’ll not employ a sin wave model when data (the relevant information) appear to have a strong linear relationship. Nor will I employ a linear model for data showing a quadratic pattern. All statistical models are postulated based on the data structures & relevant information. Both classical and Bayesian methods are to sort through all information available the best they can. No relevant information, including S that can be easily detected, is to be ignored. NO statistical models ignore any information; it’s the person who ignores the information and therefore employs an inappropriate model.
Still, classical frequentist thinking is NOT cookbook all the way.
JohnK,
You don’t seem to have read that I wrote “Or, perhaps, I simply don’t get it”? Why would you think that I am attacking Mr. Briggs? Why are so you so protective of Mr. Briggs that you totally ignore my points on what statistical modelling is?
JohnK, I too speak English natively, but I think JH is right to be puzzled by “If we knew X or assumed X was true for the sake of argument, then given X the probability of Y being true changes if we knew or assumed X was false.â€
A clearer version of what Briggs intended may, I think, be achieved by replacing “changes” with “would be different from what it would have been”. And if this is right then it might also have been preferable not to include the words “given X” since that just amounts to repeating the already stated condition “If we knew X or assumed X was true”, and in fact the alternative condition “if we knew or assumed X was false” amounts to replacing “given X” with “given ~X”
(Strictly speaking, if “given X the probability of Y being true” is intended to mean the same as “the probability of Y being true given X”, then it represents, in colloquial terms, the value which we would assign to the probability of Y *if* we knew or assumed X was true – which is completely independent of whether we actually do know or assume X is true.)
What is X? Is X the specific event of Jack having an IQ of 107? Is Y the specific event of Jack making $72,000 a year? Let’s say the answers to both questions are yes. Then, P(Y|X) can be anyone’s guess since there is not enough information to postulate an adequate model. You can defined P(Y|X) = P(Y|X is false), i.e., no change, whatever it means to say that “X is false†to you. In a way, I am saying that you can’t construct a meaningful model based a sample size of one in this case.
Also, the sample Person correlation coefficient is not defined for events or string variables, and that Xs and Ys can’t be plotted (as shown in this post) either.
If X is a variable defined to be IQ score, not Jack’s IQ score. There is no point saying that Jack’s IQ is a variable since it’s a fixed value. Whether Jack’s IQ takes on the value of 107 is something to be observed, not to be assumed true or false. If Jack’s IQ not observed, you simply can’t use it to postulate a statistical model.
The notation is unclear, and it doesn’t make sense to say “we assumed X was false”
By asking Mr. Briggs questions, I am suggesting him to think about them. I don’t really need his answers on any of my statistics question because I usually have answers already. I have said this before. Yes, I am that arrogant, and I do understand that mistakes are unavoidable when you are a productive and possible busy person.
Again, classical frequentist thinking is NOT cookbook all the way.
—
Thanks, Mr. Cooper.
Briggs, although the choice and use of notation may appear to some readers as platitudes, they play an extremely vital role in making mathematics and statistics meaningful and deserve serious attention.