What’s the difference between “independence” and “irrelevance” and why does that difference matter? This typical passage is from *The First Course in Probability* by Sheldon Ross (p. 87) is lovely because many major misunderstandings are committed, all of which prove “independence” a poor term: (and this is a book I recommend; for readability, I changed Ross’s notation slightly, from e.g. “P(E)” to “Pr(E)”)

The previous examples of this chapter show that Pr(E|F), the conditional probability of E given F, is not generally equal to Pr(E), the unconditional probability of E. In other words, knowing that F has occurred generally changes the chances of E’s occurrence. In the special cases where Pr(E|F) does in fact equal Pr(E), we say that E is independent of F. That is, E is independent of F if knowledge that F occurred does not change the probability that E occurs.

Since Pr(E|F) = Pr(EF)/Pr(F), we see that E is independent of F if Pr(EF) = Pr(E)Pr(F).

The first misunderstanding is “Pr(E), the unconditional probability of E”. There is no such thing. No unconditional probability exists. All, each, every probability must be conditioned on something, some premise, some evidence, some belief. Writing probabilities like “Pr(E)” is always, every time, an error, not only of notation but of thinking. It encourages and amplifies the misbelief that probability is a physical, tangible, measurable thing. It also heightens the second misunderstanding. We must always write (say) Pr(E|X), where X is whatever evidence one has in mind.

The second misunderstanding, albeit minor, is this: “knowing that F has occurred generally changes the chances of E’s occurrence.” Note the bias towards empiricism. In other places Ross writes “An infinite sequence of independent trials is to be performed” (p. 90, an impossibility); “Independent trials, consisting of rolling a pair of fair dice, are performed (p. 92, “fair” dice are impossible in practice). “Events” or “trials” “occur”, i.e., propositions that can be measured in reality, or are mistakenly thought to be measurable. Probability is much richer than that.

Non-empirical propositions, as in logic, easily have probabilities. Example: the probability of E = “A winged horse is picked” given X = “One of a winged horse or a one-eyed one-horned flying purple eater must be picked” is 1/2, despite that “events” E and X will never occur. So maybe the misunderstanding isn’t so minor at that. The bias towards empiricism is what partly accounts for the frequentist fallacy. Our example E and X have *no* limiting relative frequency. Instead, we should say of any Pr(E|F), “The probability of E (being true) accepting F (is true).”

The third and granddaddy of all misunderstandings is this: “E is independent of F if knowledge that F occurred does not change the probability that E occurs.” The misunderstanding comes in two parts: (1) use of “independence”, and (2) a mistaken calculation.

Number (2) first. Because it is a mistake to write “Pr(EF) = Pr(E)Pr(F)”, there are times, given the same E and F, when this equation holds and times when it doesn’t. A simple example. Let E = “The score of the game is greater than or equal to 4” and F = “Device one shows 2”. What is Pr(E|F)? Impossible to say: we have no evidence tying the device to the game. Similarly, Pr(E) does not exist, nor does Pr(F).

Let X = “The game is scored by adding the total on devices one and two, where each device can show the numbers 1 through 6.” Then Pr(E|X) = 33/36, Pr(F|X) = 1/6, and Pr(E|FX) = 5/6; thus Pr(E|X)Pr(F|X) (~0.153) does not equal Pr(E|FX)Pr(F|X) (~0.139). Knowledge of F *in the face of X* is *relevant* to the probability E is true. (Recall these do not have to be real devices; *they can be entirely imaginary*.)

Now let W = “The game is scored by the number shown on device two, where device and one and two can show the numbers 1 through 6.” Then Pr(E|W) = 1/2, Pr(F|W) = 1/6, and Pr(E|FW) = 1/2 because knowledge of F *in the face of W* is *irrelevant* to knowledge of E. In this case Pr(EF|W) = Pr(E|W)Pr(F|W).

The key, as might have always been obvious, is that relevance depends on the specific information one supposes.

Number (1). Use of “independent” conjures up images of causation, as if, somehow, F is causing, or causing something which is causing, E. This error often happens in discussions of time series, as if previous time points caused current ones. We have heard times without number people say things like, “You can’t use that model because the events aren’t independent.” You can use any model, it’s only that some models make better use of information because, usually, knowing what came before is *relevant* to predictions of what will come. Probability is a measure of information, not a quantification of cause.

Here is another example from Ross showing this misunderstanding (p. 88, where the author manages two digs at his political enemies):

If we let E denote the event that the next president is a Republican and F the event that there will be a major earthquake within the next year, then most people would probably be willing to assume E an F are independent. However, there would probably be some controversy over whether it is reasonable to assume that E is independent of G, where G is the event that there will be a recession within two years after the election.

To understand the second example, recall that Ross was writing at a time when it was still possible to distinguish between Republicans and Democrats.

The idea that F or G are the full or partial efficient cause of E is everywhere here, a mistake reinforced by using the word “independence”. If instead we say that knowledge of the president’s party is *irrelevant* to predicting whether an earthquake will soon occur we make more sense. The same is true if we say knowledge that this president’s policies are relevant for guessing whether a recession will occur.

This classic example is a cliché, but is apt. Ice cream sales are positively correlated with drownings. The two events, a statistician might say, are not “independent”. Yet it’s not the ice cream that is *causing* the drownings. Still, *knowledge* that more ice cream being sold is *relevant* to fixing a probability more drownings will be seen. The *model* is still good even thought it is silent on cause.

**Keynes**

*This sections contains more technical material.*

The distinction between “independence” and “irrelevance” was first made by Keynes in his unjustly neglected *A Treatise on Probability* (pp. 59–61). Keynes argued for the latter, correctly asserting, first, that no probabilities are unconditional. Keynes gives two definitions of irrelevance. In our notation, “F is irrelevant to E on evidence X, if the probability of E on evidence FX is the same as its probability on evidence X; i.e. F is irrelevant to E|X if Pr(E|FX) = Pr(E|X)”, as above.

Keynes tightens this to a second definition. “F is irrelevant to E on evidence X, if there is no proposition, inferrible from FX but not from X, such that its addition to evidence X affects the probability of E.” In our notation, “F is irrelevant to E|X, if there is no proposition F’ such that Pr(F’|FX) = 1, Pr(F’|X) \ne 1, and Pr(E|F’X) \ne Pr(E|X).” Note that Keynes has kept the logical distinction throughout (“inferrible from”).

Lastly, Keynes introduces another distinction (p. 60; pardon the LaTex):

$h_1$ and $h_2$ are independent and complementary parts of the evidence, if between them they make up $h$ and neither can be inferred from the other. If x is the conclusion, and $h_1$ and $h_2$ are independent and complementary parts of the evidence, then $h_1$ is relevant if the addition of it to $h_2$ affects the probability of $x$.

A passage which has the footnote (in our modified notation): “I.e. (in symbolism) $h_1$ and $h_2$ are independent and complementary parts of $h$ if $h_1 h_2 = h$, $Pr(h_1|h_2) \ne 1$, and Pr(h_2|h_1) \ne 1$. Also $h_1$ is relevant if Pr(x|h) \ne Pr(x|h_2).”

Two (or however many) observed data points, say, $x_1$ and $x_2$ are independent and complementary parts of the evidence because neither can be deduced—not mathematically derived *per se*—from each other. Observations are thus no different than any other proposition.