This class is a must watch for all; or at least a must read. I’ve been showing you how badly cause is misidentified in sciences which use statistics. Today a logic puzzle for you.
You don’t need to have watched or read any of the previous material to watch this.
Uncertainty & Probability Theory: The Logic of Science
Video
Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty
HOMEWORK: Given below; see end of lecture.
Lecture
This is an excerpt from Chapter 7 of Uncertainty.
This section is relevant for all statistical and probability models which form the conceit that they have identified the cause of some data; the material is based on [an earlier paper of mine]. Suppose we learned that 1,000 people were “exposed” to PM2.5—which is to say, particulate matter 2.5 microns or smaller—at some zero or trace level, and that another group of the same size was exposed to high amounts. Call these two groups “low” and “high PM2.5”. Suppose, too, it turns out 5 people in the low group developed cancer of the albondigas, and that 15 folks in the high group contracted the same dread disease. (If you don’t love this example, substitute placebo versus drug or some other on-and-off, yes-or-no dichotomous state.)
What caused the observed difference in cancer rates? Some thing or things caused each unfortunate person in our experiment to develop cancer. What could this cause or these causes be? Notice I emphasize that there may be more than one cause present. It needn’t be the same thing operating on each individual. Each of the 20 people may have had a different cause of their cancer; or each of the 20 may have had the same cause. And this is so even though it may be that cancer of the albondigas is caused in the human body in only one way. Suppose some particular bit of DNA needs to “break” for the cancer to develop, and that this DNA can only break because of the presence of some compound in just those individuals with a certain genetic structure. Then the cause or causes of the presence of this compound become our main question: how did it come to be in each of these people? That cause may be the same or different.
There is no proof in the data that high levels of PM2.5 cause cancer of the albondigas. If high levels did cause cancer, then why didn’t every one of the 1,000 folks in the high group develop it? If high PM2.5 really is a cause—and recall we’re supposing every individual in the high group had the same exposure—then it should have made each person sick. Unless it was prevented from doing so by some other thing or things; e.g. perhaps a counter-balancing cause operates that acts “oppositely” of PM2.5. High PM2.5 cannot be a complete cause: it may be necessary, but it cannot be sufficient. And it needn’t be a cause at all. The data we have is perfectly consistent with some other thing or things, unmeasured by us, causing every case of cancer. And this is so even if all 1,000 individuals in the high group had cancer.
This always-or-nothing is true for every hypothesis; that is, every set of data. The proposed mechanism is either always an efficient cause, though it sometimes may be blocked or missing some “key” (other secondary causes or catalysts) or be counterposed by some other cause, or it is never a cause. There is no in-between. Always-or-never a cause is tautological, meaning there is no information added to the problem by saying the proposed mechanism might be a cause. From that we deduce a proposed cause, absent knowledge of essence, said or believed to be a cause based on some function of the data, is always a prejudice, conceit, or guess. Because our knowledge that the proposed cause only might be always (albeit possibly sometimes blocked) or never an efficient cause, and this is tautological, we cannot find a probability the proposed cause is a cause—conditioned only on that tautology, that is.
Consider also that the cause of the cancer could not have been high PM2.5 in the low group, because, of course, the 5 people there who developed cancer were not exposed to high PM2.5 as a possible cause. Therefore, their cause or causes must have been different if high PM2.5 is a cause. And even if PM2.5 is a cause, it is not necessary the only cause. The same cause that operated in the low group, or some other cause entirely, might have struck some or all of the afflicted in the high group. In other words, since we don’t know if high PM2.5 is a cause, we cannot know whether whatever caused the cancers in the low group didn’t also cause the cancers in the high group. Recall that there may have been as many as 20 different causes. We conclude that nothing in the plain observations is of any help in deciding what is or isn’t a cause. That statement has tremendous importance when considering standard statistical procedures.
Given the multitude of possible measures we can make on actual people—everything from whatever they’ve eaten over the course of their life to the environments to which they have been exposed, and on and on almost (but never in reality) endlessly—it is more than reasonable to suppose that we can discover some thing which is also different between the two groups besides exposure levels. Suppose it turns out—and something like this almost surely will—every person in the high group ate at least one more banana than did folks in the low group. That means whatever conclusions we reach via some statistical analysis, we could have equally well put down to having eaten more bananas. This is because the label “low PM2.5” and “high PM2.5” can be swapped for “low banana” and “high banana”, a set of measurements just as true and valid. Call this the banana test.
Clearly, there was some thing or some things different between the two groups. There must have been, because the number of people who got cancer was different, and the difference was caused, as must be true. But there is absolutely nothing in the observations alone that tell us what this cause was or what these causes were. We are not just discussing PM2.5. The criticisms here apply to every classical statistical analysis ever done.
Yet there is plausible suspicion that PM2.5 and not bananas might cause disease. We know this because we suspect it is in the nature of fine particulate matter to interact with, and possibly interfere with, the functioning of the lungs, the nature of which we also have some grasp. We do not know just based on the raw data—and never forgot that we can only know what is true: though we can believe anything—that PM2.5 causes cancer. A reasonable condition, given what we have learned from other dose-response relationships, is that greater exposure to PM2.5 will give more opportunity for whatever it is in PM2.5 that causes cancer to operate. But we don’t have that in this experiment. So we can only assume PM2.5 is a cause and make verifiable predictions to test this assumption.
Notice that in this approach we must assume that (high) PM2.5 is always a cause but that sometimes it is stopped from operating because of some lack: say, a person has to have a specific genetic code, or must inhale the dust only when breathing is labored, or some chemical must be present, or whatever—the exact conditions may be exceedingly complex. As we saw above, the only other assumption is that PM2.5 is not a cause, and if it is not, then we must not use a probability model supposing PM2.5 is a cause.
This implies the following curious result. Probability models aren’t what you might have thought. If we assume PM2.5 is a cause, then we must conclude that it is sometimes blocked, else all 1,000 in the high group would have become ill. And recall that if we assume PM2.5 is a cause, it necessarily implies there is at least one other cause, a cause which must exist to account for the illnesses in the low group. Saying PM2.5 is a cause thus creates a mystery: what is this other cause (or causes)? But it also means that the probability model in the high group is not a model of cause: it is a model of blocking. The probability models doesn’t say, not really, “This person has a this-or-that chance of developing illness if exposed to PM2.5”, rather, “The chance the causal effect of PM2.5 is blocked is this-and-such.” And even that pronouncement is still conditional on believing the other cause or causes besides PM2.5 don’t operate in the presence of PM2.5, and where is the evidence for that? There is none. Probability models always belie uncertainty. They are never proof of cause, which is why automated attempts to “prove” cause in large collections of data, must fail. Uncertainty always lingers unless there is knowledge of power and essence. Probability models themselves are explored in depth next Chapter.
Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.
At your insistence I watched the whole thing (usually I’m reading). The subject matter was not only interesting, useful, and clearly presented but was conveyed in style worthy of the Shakespearean stage. Maybe not quite Edmund Kean as Macbeth, but yet full of passion, conviction, wit, and vital force. Bravo.
Hagfish,
I’d submit it to the Oscars, but I don’t meet the Diversity quota.
You are a good teacher. Positive, confident, deeply knowledgable, and entertaining. It’s obvious you love what you do and that spirit inspires a student to overcome challenges and enjoy learning. It’s also funny and charming you’re doing this on an obscure channel for no pay. The humility is beautiful. A true work of art, and as such, serving God. And who knows where the seeds may fall. In a saner world you would be a beloved and honored professor, but your glory would be no greater.
And you’re not alone. All around the web I find men and women working to clear the funky sludge of satanic drivel for no pay while enduring insult and injury as well. And yet finding deep satisfaction in work well done. The work is the pay. Adversity, like lifting weights, is salutary. So in spite of, or because of, everything going to shit — it’s a great time to be alive. If you have the right frame of mind.
I went to a data science talk that had this methodology:
-Take a time series of two quantities
-Treat this as the result of a system of ordinary differential equations.
-Use software to find functions from a huge library of potential functions which fit the system and which explode outside of the observed data.
-Claim that this makes good “short-term predictions” because if we put a set of observed values and times back into the functions we found, we get something close to the next observation.
-Claim that it makes good “long-term predictions” by analyzing the phase plane and saying that there are “believable” reasons for the behavior that we observed.
The word “believable” was used something like 20 times in a talk, as if it were supposed to be very impressive. For example, there was data comparing COVID cases to attendance of people to public events. The resulting solution had an attractor point with low number of cases and low attendance. This was defending by saying “it’s believable that if we had reduced the number of cases, then COVID could have been managed without everyone getting sick.” (Even though since we have a stationary point at the attractor, that means that we would have had a low number of people being sick forever, and also that attendance to public events would have remained low forever; remember this was described as a “long-term prediction.”) The presenter was also extremely proud of the fact that in the model the trajectories in the area of high cases and high attendance to public events tended towards a line that increased cases and decreased attendance since “it is believable that when lots of people are sick, more people will get sick and less people will go to events because they are sick.” This despite:
1.) You could easily make that claim without a model, so the model did nothing for us.
2.) The model was specifically constructed so that all trajectories outside of the observed data would explode, meaning that when you are near the boundary of the observed data you are GUARANTEED to have behavior like this.
3.) If we actually followed the trajectory then more people would get sick than the population of the observed region, and eventually there would be 0 (or even negative) attendance to public events. Obviously this didn’t happen.
Even when you know things are bad, it’s always amazing to see how they can still end up being worse than you thought.
What Causes Cancer Of The Albondigas?
Mexican food?
When i was in college and was taking my first statistics course we were taught that you can’t prove causality with statistics. All you can do is show a correlation and they are a dime a dozen.
Hummmm . . .
Substitute Russian Roulette for PM2.5s. Five in a thousand die in the control group; definitely not caused by Russian Roulette. And 169 die in the exposed group. Even though the exposed group might have eaten more bananas we would think proper. If Russian Roulette is a cause of DEATH then why is everyone not dead?
BTW: My Mexican Granny got cancer of the Albondigas and she was a victim of Climate Change. QED!