Resolved: Statisticians To Cease Using “Independence”, Change To “Irrelevance”

What’s the difference between “independence” and “irrelevance” and why does that difference matter? This typical passage is from The First Course in Probability by Sheldon Ross (p. 87) is lovely because many major misunderstandings are committed, all of which prove “independence” a poor term: (and this is a book I recommend; for readability, I changed Ross’s notation slightly, from e.g. “P(E)” to “Pr(E)”)

The previous examples of this chapter show that Pr(E|F), the conditional probability of E given F, is not generally equal to Pr(E), the unconditional probability of E. In other words, knowing that F has occurred generally changes the chances of E’s occurrence. In the special cases where Pr(E|F) does in fact equal Pr(E), we say that E is independent of F. That is, E is independent of F if knowledge that F occurred does not change the probability that E occurs.

Since Pr(E|F) = Pr(EF)/Pr(F), we see that E is independent of F if Pr(EF) = Pr(E)Pr(F).

The first misunderstanding is “Pr(E), the unconditional probability of E”. There is no such thing. No unconditional probability exists. All, each, every probability must be conditioned on something, some premise, some evidence, some belief. Writing probabilities like “Pr(E)” is always, every time, an error, not only of notation but of thinking. It encourages and amplifies the misbelief that probability is a physical, tangible, measurable thing. It also heightens the second misunderstanding. We must always write (say) Pr(E|X), where X is whatever evidence one has in mind.

The second misunderstanding, albeit minor, is this: “knowing that F has occurred generally changes the chances of E’s occurrence.” Note the bias towards empiricism. In other places Ross writes “An infinite sequence of independent trials is to be performed” (p. 90, an impossibility); “Independent trials, consisting of rolling a pair of fair dice, are performed (p. 92, “fair” dice are impossible in practice). “Events” or “trials” “occur”, i.e., propositions that can be measured in reality, or are mistakenly thought to be measurable. Probability is much richer than that.

Non-empirical propositions, as in logic, easily have probabilities. Example: the probability of E = “A winged horse is picked” given X = “One of a winged horse or a one-eyed one-horned flying purple eater must be picked” is 1/2, despite that “events” E and X will never occur. So maybe the misunderstanding isn’t so minor at that. The bias towards empiricism is what partly accounts for the frequentist fallacy. Our example E and X have no limiting relative frequency. Instead, we should say of any Pr(E|F), “The probability of E (being true) accepting F (is true).”

The third and granddaddy of all misunderstandings is this: “E is independent of F if knowledge that F occurred does not change the probability that E occurs.” The misunderstanding comes in two parts: (1) use of “independence”, and (2) a mistaken calculation.

Number (2) first. Because it is a mistake to write “Pr(EF) = Pr(E)Pr(F)”, there are times, given the same E and F, when this equation holds and times when it doesn’t. A simple example. Let E = “The score of the game is greater than or equal to 4” and F = “Device one shows 2”. What is Pr(E|F)? Impossible to say: we have no evidence tying the device to the game. Similarly, Pr(E) does not exist, nor does Pr(F).

Let X = “The game is scored by adding the total on devices one and two, where each device can show the numbers 1 through 6.” Then Pr(E|X) = 33/36, Pr(F|X) = 1/6, and Pr(E|FX) = 5/6; thus Pr(E|X)Pr(F|X) (~0.153) does not equal Pr(E|FX)Pr(F|X) (~0.139). Knowledge of F in the face of X is relevant to the probability E is true. (Recall these do not have to be real devices; they can be entirely imaginary.)

Now let W = “The game is scored by the number shown on device two, where device and one and two can show the numbers 1 through 6.” Then Pr(E|W) = 1/2, Pr(F|W) = 1/6, and Pr(E|FW) = 1/2 because knowledge of F in the face of W is irrelevant to knowledge of E. In this case Pr(EF|W) = Pr(E|W)Pr(F|W).

The key, as might have always been obvious, is that relevance depends on the specific information one supposes.

Number (1). Use of “independent” conjures up images of causation, as if, somehow, F is causing, or causing something which is causing, E. This error often happens in discussions of time series, as if previous time points caused current ones. We have heard times without number people say things like, “You can’t use that model because the events aren’t independent.” You can use any model, it’s only that some models make better use of information because, usually, knowing what came before is relevant to predictions of what will come. Probability is a measure of information, not a quantification of cause.

Here is another example from Ross showing this misunderstanding (p. 88, where the author manages two digs at his political enemies):

If we let E denote the event that the next president is a Republican and F the event that there will be a major earthquake within the next year, then most people would probably be willing to assume E an F are independent. However, there would probably be some controversy over whether it is reasonable to assume that E is independent of G, where G is the event that there will be a recession within two years after the election.

To understand the second example, recall that Ross was writing at a time when it was still possible to distinguish between Republicans and Democrats.

The idea that F or G are the full or partial efficient cause of E is everywhere here, a mistake reinforced by using the word “independence”. If instead we say that knowledge of the president’s party is irrelevant to predicting whether an earthquake will soon occur we make more sense. The same is true if we say knowledge that this president’s policies are relevant for guessing whether a recession will occur.

This classic example is a cliché, but is apt. Ice cream sales are positively correlated with drownings. The two events, a statistician might say, are not “independent”. Yet it’s not the ice cream that is causing the drownings. Still, knowledge that more ice cream being sold is relevant to fixing a probability more drownings will be seen. The model is still good even thought it is silent on cause.


This sections contains more technical material.

The distinction between “independence” and “irrelevance” was first made by Keynes in his unjustly neglected A Treatise on Probability (pp. 59–61). Keynes argued for the latter, correctly asserting, first, that no probabilities are unconditional. Keynes gives two definitions of irrelevance. In our notation, “F is irrelevant to E on evidence X, if the probability of E on evidence FX is the same as its probability on evidence X; i.e. F is irrelevant to E|X if Pr(E|FX) = Pr(E|X)”, as above.

Keynes tightens this to a second definition. “F is irrelevant to E on evidence X, if there is no proposition, inferrible from FX but not from X, such that its addition to evidence X affects the probability of E.” In our notation, “F is irrelevant to E|X, if there is no proposition F’ such that Pr(F’|FX) = 1, Pr(F’|X) \ne 1, and Pr(E|F’X) \ne Pr(E|X).” Note that Keynes has kept the logical distinction throughout (“inferrible from”).

Lastly, Keynes introduces another distinction (p. 60; pardon the LaTex):

$h_1$ and $h_2$ are independent and complementary parts of the evidence, if between them they make up $h$ and neither can be inferred from the other. If x is the conclusion, and $h_1$ and $h_2$ are independent and complementary parts of the evidence, then $h_1$ is relevant if the addition of it to $h_2$ affects the probability of $x$.

A passage which has the footnote (in our modified notation): “I.e. (in symbolism) $h_1$ and $h_2$ are independent and complementary parts of $h$ if $h_1 h_2 = h$, $Pr(h_1|h_2) \ne 1$, and Pr(h_2|h_1) \ne 1$. Also $h_1$ is relevant if Pr(x|h) \ne Pr(x|h_2).”

Two (or however many) observed data points, say, $x_1$ and $x_2$ are independent and complementary parts of the evidence because neither can be deduced—not mathematically derived per se—from each other. Observations are thus no different than any other proposition.


  1. I now have a new word to use. When people ask me why I’m using ‘irrelevant’ rather than ‘independent’, then I can bother them about empiricism and logical probabilities! Yeah!

    Evidence and context matter in these things, and using ‘independent’ can lead to some absurd statements.

    For example, let’s say that I am curious as to the probability of a structure failing given the choice of material used. The structure is a coffee table, and I’m choosing between wood and steel. I run all kinds of tests (kids jumping up and down on it, chubby uncle Herbert sitting on it at a super bowl party, etc.) and discover that the table never failed for either construction.

    I have some Pr(Fail), which equals zero. I also have Pr(Fail|Wood) and Pr(Fail|Steel) that also equal zero. Since the marginal equals the conditionals, we are told that Wood and Steel are ‘independent’ of failure. Of course, this is physically absurd and wrong. If I were to switch to airplane construction I’d start noticing some ‘dependence’.

  2. This book by Keynes looks very interesting. I started reading the first chapter and it feels very accessible! Is it commonly read as a primer for the study of statistics?

  3. Nate,

    It isn’t read at all, except by the rare student interested in philosophy. Most formal classwork in statistics is mathematical, and while Keynes has the odd formula, you couldn’t class his book as mathematical.

    Lack of mathematical understanding is not what is plaguing statistics. It’s the failure to comprehend what the math means.

  4. I’m a registered independent (well, technically “unaffiliated”) in a frozen blue state, making me irrelevant in every election. 😉

    Lack of mathematical understanding is not what is plaguing statistics. It’s the failure to comprehend what the math means.

    Well, actually neither understanding nor comprehension is irrelevant, but your point is crucial and overlooked in classrooms. It’s so much easier to plunge into the equations than to discuss philosophy when the 15-week semester clock starts ticking.

  5. Saying there is no difference between Republicans and Democrats is like saying there is no difference between Coke and Pepsi. Clearly Coke is better.

  6. Doug—To a Coke drinker. If you drink little soda, you likely cannot tell the difference between the two. However, your analogy stands—most people can’t tell the difference between Republicans and Democrats unless maybe if they are highly involved in politics. (Is there a difference? I don’t know—I’m not highly involved enough in politics. 🙂 )

  7. Sheri,

    As someone who follows politics, the primary difference between Republicans and Democrats is that Republicans are more likely (p << 0.01) to self-identify as Republicans. The flip is also true.

  8. Independence has its uses. It is difficult to imagine talking about irrelevant variables in math and physics without causing a great deal of confusion. This is extended to the use of statistical independence in statistical mechanics. For example, that the mean of xy is equal to the mean of x times the mean of y when x & y are statistically independent, but that the mean of the square is not equal to the square of the mean of x. Converting to the language of irrelevance would not seem to be worth the candle.

  9. I’ve been using Chrome lately and just experienced a redirect when clicking on the “refresh” button but a second refresh brought me back. Chrome is one of the fastest browsers I’ve found. I think it is because relies on caching more than the others. It is devilishly difficult to avoid them as they don’t appear to be local nor singularly placed.

  10. I dunno about this independence/irrelevant argument. It sounds so much like the PC replacement of one euphemism with another such a s “challenged” for “retarded”. These replacements never really fix or change anything. The underlying concepts and associations remain.

    Not that either “independence” or “irrelevance” are euphemisms but the I don’t really see the difference between them. Yes, there are those that think of probability in causal terms but changing the words used won’t stop them thinking that way. What’s the next thing? Should we go for “florbity” (or any other silliness) as a replacement for “probability”?

  11. Dr. Briggs,

    Does De Finetti and the concept of exchangeability/partial exchangeability coincide with your statement of “irrelevance/relevance”. I can’t figure it out!

  12. > Sheri – To a Coke drinker. If you drink little soda,
    > you likely cannot tell the difference between the two.


    A Coke drinker can tell from the first sip.

    When “New” Coke came out, I was in Hell for the four or five months before “Classic” Coke.

    My best bet was RC Cola or some of the generic Shasta-like colas.

    Coffee wasn’t my thing in those days. I rarely have Coke, but when I do, I still can tell from the first sip.

  13. John B: Doug said “Clearly Coke is better”. I said “To a Coke drinker”. Otherwise one probably cannot tell them apart. That would mean that a Coke drinker could tell them apart, would it not? And other flavors of soda. One’s palate learns the flavor and tends to grow attached to said flavor. Variety in soda drinking leads to an unrefined palate.

Leave a Comment

Your email address will not be published. Required fields are marked *