False is not True
We spoke earlier of falsification and why I didn’t think it was an especially useful criterion. My tweets about it inspired Deborah Mayo, who advocates for the new way of statistics (whereas I vouch for the old), to respond: “It’s not low prob but a strong arg from coincidence that warrants falsifying in sci. Essence -weasel word.” She links to her article on falsification, which I’ll assume you’ll read. Essence we’ll do later, though I responded “How can you tell a weasel from a statistician without essence? Answer: you cannot.”
Today a brief explanation why falsification isn’t exciting. For details, see my book Uncertainty: The Soul of Modeling, Probability & Statistics.
Falsified means something very specific. Words matter. We are not free to redefine them willy-nilly.
(Please forgive the use of notation; though it often leads to the Deadly Sin of Reification, it can often scoot things along.)
If a model X (a list of premises, data, etc.) for observable Y says Pr(Y | X) = p > 0, and Y subsequently is observed then X has not been falsified. And this is so even if p is as small as you like, as long as it is greater than 0. Falsified means proved false. In mathematics and logic, prove has a definite rigorous inflexible intransigent meaning. And thank the Lord for that. I propose keeping the word prove as it is and not stretching it to make some sociologist’s job easier. (Most researchers are anxious to say X did or did not cause Y.)
If Pr(Y | X) = 0, and Y is observed—and all as in all as in all as in all conditions in X having been met—then X has been proved false, i.e. it has been falsified. It is as simple as that. (What part of X that is the problem is another matter; see the book.)
So why do people say, when Pr(Y | X) = p and p is small and Y is observed, that X has been “practically (or statistically) falsified”? Because they are mixing up probability and decision. They want a decision about X and since that requires hard work and there is a tantalizingly small p offering to take the pain away, so they say “X can’t be true because p is small.” Well, brother, X can be true if p is any number but 0. That’s not Briggs’s requirement: that’s the Universal Law of Logic.
Decision is Not Probability
That small p may be negligible to you, and so you are comfortable quashing X and therefore are willing to accept the consequences of that act of belief. But to another man, that p may not be small at all. The pain caused by incorrectly rejecting X may be too great for this fellow. How can you know? Bets, and believing or disbelieving X is a sort of bet, are individualistic not universal.
Incidentally, p is not the “p-value”. The p-value says nothing about what anybody wants to hear about (conditional statements about functions of observables given model parameters equal certain values). I’ve defined and debunked the p-value so often, that I won’t repeat those arguments here (see the boo). Pr(Y | X) makes sense to everybody.
Now if you have to act, if you have to act as if X is true or false, then you will need figure how p fits in with your consequences if X should turn out true or false. That’s not easy! It has been made way, way too easy in classical statistical methods; so easy that consequences are not even an afterthought. They are a non-thought. P-values and Bayes factors act as magic, and say, “Yes, my son, you may believe (or not) X”, but they don’t tell you why, or give you p, or tell you what will happen if the hypothesis test is wrong.
X is X
X is X, as Aristotle would agree. Make even the tiniest, weest, most microscopic change to X, such that the change cannot be logically deduced from the original X, then you no longer have X.
“Is Briggs saying that if you change X into something else, X is no longer X?”
“He is.”
“Mirabile dictu!”
“If X is not-X, that means Pr(Y | X) probably won’t equal Pr(Y | not-X), and so the decision one would make whether to believe X is completely different than the decision one would make whether to believe not-X.”
“Sacré bleu! He is not the same choice!”
“Hang on. ‘Not-X’ isn’t some blanket ‘X is false’ statement. If I read my logic right, and I do, this abuse of notation ‘not-X’ means some very definite, well specified model itself, say, W.”
“Ouf! You are right! There is nothing special about that p. It is wholly dependent on X. That must mean all probability is conditional, which further weakens the utility of falsifiability.”
There is also the point that you might dislike X at Pr(Y1 | X) = p1, but love X at Pr(Y2 | X) = p2. The number of observable propositions Yi that are dependent on X may be many, and the choices you make could depend on what X says about these different propositions and which Yi are observed, which not (in these equations X is fixed). But did you notice you have to wait until Y is observed before you know how the model works? Decision is not easy!
Sloppiness Abounds
There exists classes of “machine learning” models, some of which say things like Pr(Y1 | X) = 0, i.e. they make hard predictions, like the weatherman who says, “The high will be 75 F.” If the temperature is 75.1 F, the weatherman’s model has been falsified, because he implied that any number but 75 F was impossible. Some machine “learning” models are like that. But few or none would reject X if the model was just a little off, like the weatherman was.
In other words, even though the model has been falsified, few would act on that falsification. People add “fuzz” to the predictions, which is to say, the model might insist Pr(Y1 | X) = 0, but people make a mental substitution and say either Pr(Y1 | X) = 0 + p (a false deduction) or they will agree Pr(Y1 | X) = 0 but say Pr(Y1 | W) = p, where W is not X but which is similar enough to X that “It might as well be X”. That does not follow; it is an illegal maneuver.
Of course, with the weatherman, everybody understands him to mean not “The high will be 75 F” but “The high will be about 75 F”. Then Pr(High something far from 75 | Weatherman’s model) = p where p does not equal 0. The words “something far” and “about” are indefinite. This is fine, acceptable, and desirable. The last thing in the world we need is more quantification.
Is X true?
X is an assumption, a model. We accept for the sake of argument X is true in judgments like Pr(Y | X). All of science is like this! If we want separate information whether X is true, we need statements like this: Pr (X | outside evidence). One version of is this: outside evidence = X said Y couldn’t happen, but Y did. Then Pr (X | outside evidence) = 0, i.e. X has been falsified. But another example is this: outside evidence = X said Y might happen, and Y did. Then Pr (X | outside evidence) = q, where q > 0.
There will be lots of other outside evidence, because inside most X are lots of mathematics, which we know is true based on various theorems; inside X will be data, which are assume true based on measurement (usually), and other things. So really we should be writing all along Pr (Y | X & E), where E = “background” evidence. All arguments are chains of reasoning back to indubitable propositions. Why?
No why here. That’s enough for this post. I’m abusing notation terribly here, but only because a full-blown treatment requires the sort of room you have in a book.
Did somebody say book? Uncertainty: The Soul of Modeling, Probability & Statistics.
Briggs, a nice article on epistemology, but I’m not sure how useful it is in practice. Consider for example, two experiments. First the Michelson-Morley experiment which showed (presumably) no ether drift. Quoting from the Wikipedia article on this:,
And the Aspect experiments disproving Bell’s Theorem were not without loopholes and limits set by experimental error. (Web search Aspect experiments errors).
If I understand you correctly, one would not have taken these experiments to disprove ether drift or Bell’s Theorem. Or maybe I didn’t understand you and you can expand in the context of experiments such as those above.
“But few or none would reject X if the model was just a little off, like the weatherman was.” I was just asking a global warming believer how close is close enough? So far, the only answer I have ever gotten is “The models are close enough”. They dare not quantify. (Sorry, some things do need quantified when they affect everyone’s future.)
Bob: Interesting questions. Something for me to research, I guess. Unless you have a blog post! (Shameless self promotion is allowed here!)
All: I am continually amazed how much of statistics and math and even physics is closer to philosophy that anything else. So many different schools of thought. It’s nothing like classic Newtonian physics, where we can just drop something off a tall building nor like classic research where we do X and see if Y happens, not predict how likely Y is. Science today is no longer black and white. Philosophy seems to decide which road a scientist travels.
I’ve been in discussions with people that insist a person is unjustified in believing that result Y occurred if the probability for Y is very small, given X. Pure nonsense if you actually think about it. Everything Briggs wrote about here needs to be considered.
What do you think about the placebo effect? If result Y occurs both with and without the medication what does that tell us, if anything?
I always understood the “falsification principle” to mean that, in order for a theory to be considered ‘scientific’ it should be possible in principle to falsify that theory by real world observation, independently of whether is has actually been falsified or not. If not then that theory is not about the real world at all.
I might have that wrong. It was a long time ago.
If X = “data” (or a list of premises), what would it mean to say “data have not been falsified”?
If Pr(Y | X) = 0, it means that given the premise X, Y is false. If given X, Y is observed, it implies the probability assignment of 0 is incorrect. However, X is can still be true.
For example, P(the cat is male | a calico cat) = 0. If one observes a male calico cat, it doesn’t imply the premise X of “a calico cat” has been proved false. However, the probability assignment is proved wrong.
!X (not X) is simply not X. What else it might be is not specified or implied; the bounds of what it might be depend on its “universe” (2d,3d; venn diagram stuff).
Computers usually use Boolean algebra where only two states are possible; thus !0 == 1 and !1 == 0, always and only. But in a computer using a byte or integer, not-zero is typically represented by the value -1, or in binary, all 1’s.
Sometimes the question is itself fuzzy. “Are you a Christian?” has so much fuzz that after scraping it off not much is left. If I say “yes”, does it mean completely, entirely, a perfect alignment? If I say “no” does it mean complete exclusion? I believe it means whatever each person wants it to mean for whatever purposes are on the table.
Rich: The word “science” is itself rather fuzzy in meaning. If a claim is falsifiable, then it is probably scientific, but the converse does not necessarily follow; if a thing is not falsifiable it might still be “scientific”; just not falsifiable.
Bob Kurland: Your interest in experiments on ether resembles mine somewhat, having become aware that the outcome depends somewhat on where you are when you perform the experiment. Fortunately my daily life proceeds satisfactorily without knowing the complete answer.
JH: Your calico cat example is a good one. However, I am reading the P(Y|X) to mean that if it’s a male cat, then it is not possible for it to be a calico cat, rather than there are not calico cats. If Y is true in every case, then there are no calico cats (nor anymore cats thereafter, but we’ll overlook that for now). I guess how one reads this makes a difference in how it’s interpreted.
I am a bit concerned that I find myself agreeing with Briggs more often than not, at least on his statistical points. I have a couple of thoughts on this interesting topic. I think my major concern with the issue is that showing something is “false” seems to assume something can be shown to be or at least taken to be “true.”
One consideration is that, at least in physics, the statement that is usually sought is the following (to abuse Briggs’s notation even more) given a set of data X, a subset of R (all possible observations of the universe) we wish to say that Y = f(X) has utility, usually meaning is not too wrong or better, not “wronger” than another function, say g(*), where f(*) is some well-defined function and Y is some other subset of R. This also means that Y and X are not “premises” but observations or experimental results of some subset of the universe. It is important that we don’t ask is Y = f(X) True or False, simply does it have utility. The usual topic of investigation is to determine if f(*) is as good or not as good as some other function, say Y = g(*), for now. The caveat “for now” is important because in the conduct of physics, any function we have to calculate one set of observations from another is always only useful (or not wrong) as long as another function has not come along to show itself better (meaning it applies more generally, to a larger set of observations). In Briggs’s notation this would be stated Pr(Y=f(X)|X) versus Pr(Y=g(X)|X), where in this case Y and X are classes of possible observations, not just a single observation. This means we are not seeking information about the specific Y and X but about the general function that relates various possible Y’s with X’s.
Now the question is if we have two candidate functions f and g, how do we decide between wrong and wronger (maybe there is a movie title in that?). First, what do we mean by decide? In practice this is ultimately a personal decision of the scientist. (He or she may be able to convince others to make the same decision, but the decision is individual.) The question then becomes: how does that scientist make that decision? One way would be to take a particular pair of sets of observations, X1 and Y1, and mathematically express Y1 = f(X1), plot results and then do just what Briggs says not to do, calculate some kind of statistical test and look at that value, p-value or some other. To me, this has little decisive power, because what we are seeking is the utility of the general function Y = f(X), not the specific utility of Y1 = f(X1). A better measure of utility would be looking at Yi = f(Xi) for as many different subsets of the observable universe, i, one can find, the more the better, looking at different phenomena, different places, at different scales, etc. As the number of subsets for which the function f produces predictions experimentally close (needs a little interpretation, again by the individual scientist) then the function f might be claimed to be good, or not too wrong, but always “for now.” Part of this “for now” is because we always have experimental errors or limits in the observations we make and another part is that the universe has a lot of things we have not measured yet, maybe even things we did not measure in the experiments we already did. But the other part of the individual scientist’s view is that f is less wrong than any other function g that has so far been proposed. So the true test of whether f is “right” or “correct” is simply that it is significantly less wrong than any other function that has been proposed, for the set of observations that exist so far, in the opinion of the individual scientist. Once another function is found that is not wrong where f is not wrong but is also not wrong for different subsets of the observable universe where f is more wrong, then that other function becomes not wrong, for now. Once the scientist makes his/her decision(s) they can apply them to things such as rocket launches, bridge building, etc.
Toward the end of the “Essential vs Empirical Models” post I gave a little example of this process using gravitational fall (as Sheri mentions) to show how the process works, in my view at least. The ether/constant speed of light notion that BobK mentions also can be used as an example of this process, with a bit of elaboration.
Something I am involved with now gives an example of my own process of decision. There are physical and chemical phenomena to which bacteria and viruses can be exposed which damage the DNA of the organisms such that the organism can no longer replicate, i.e. it is dead, inactivated. Unfortunately, such bugs have a will to survive so that, even if their DNA is damaged, they can later repair certain amounts and kinds of damage. This means that one has to expose the bugs to larger and larger doses of insult to kill larger and larger fractions of the bugs, to overwhelm the repair capability. So the bugs seem to have a susceptibility to damage and an ability to repair. Now, experiments on these bugs expose populations of the bugs to various dosages of the insult and plot the number of the bugs that survive, usually the ratio of living bugs in the exposed population to the number of living bugs in a control, unexposed population. It invariably turns out that the fraction killed increases as dosage increases, but then reaches a level at which increasing dose makes little difference in the amount killed. A prevailing interpretation, i.e. a function that “predicts” it, is that the bug population has a component that is susceptible and a component (i.e. a genetically distinct subpopulation) that is resistant. In this case the fraction at which the killing levels off of with dose represents the fraction of the population that is resistant. Statistical manipulation of the data can yield “significant” estimates of the fraction of the population that is resistant and characteristics of its resistance. I am not ready to decide that function (resistant subpopulation) is not wrong for a couple of reasons. First is that as far as one can tell, the gene sequences of the organisms do not demonstrate significant variations within the populations, but the bugs have millions of base pairs and finding this distinction is a bit iffy. Second, the resistant population fraction seems to depend on (i.e. it varies) whether the bugs are insulted when suspended in various liquids or air, or dry on surfaces, or in various other media that should not affect the damage mechanism, despite the fact that the repair process occurs during incubation, long after exposure. Third, the levels of the resistant subpopulation always seem to be suspiciously close to the sensitivity limit of the experiments done. Biological experiments like this are messy things (compared to physics) in which one has to put a little liquid with organisms down on agar plates, mush it around, incubate, and then count the dots that show up the next day. Worse one has to guess at how much to dilute the original liquid because one can’t count more than a certain number of dots, and less than a certain other number means the sampling of the small amount of liquid one puts down on the plates varies too much to give a tight error distribution. The result is that there is a bit of biological “art” required to get numbers that don’t bounce around a bit. So, on this issue there are at least two camps on the decision, genetically distinct resistant subpopulations exist within the overall population or they do not, in which case the observations simply show the signal to noise limit of the experimental procedures. Right now I find myself in the “do not” camp but that is subject to change as more information becomes available. No amount of statistical testing will convince me one way or another, more data taken in other ways might. But the answer does make a significant difference in the utility of the insulting phenomena, the adaptation of the organism to the environment, how it may have evolved, etc.
Falsification for me is not absolute. It is an attempt to point at the path to truth while also hinting at paths not pointing at Truth. It tells me that every morning when I get up, all of the things I knew from yesterday to be true are candidates for rejection. No matter how well I have ‘proved’ to myself that something is true, all I have ever truly done is failed to disprove that it wasn’t true.
In order to lead people, part of what we have to do is give them an anchor point that is relatively easy for them to comprehend. Proving your point is ever so much easier to convey than failing to disprove it.
And we get to the conundrum. Do I espouse certainty or uncertainty. Certainty gets you followers of intent. Uncertainty gets to a rag tag crew. Intent followers aren’t uncertain about their cause and will beat me to a pulp while the uncertainty crowd is quite certain that they don’t know and will sit on the side lines waving their flags and getting nowhere. But truth is in the uncertainty. Getting things done is certainty.
Appearances do matter.
Sheri, your comment makes me realize that I have read Briggs’ statement incorrectly. If indeed P(it’s is male| a calico cat) =0 AND we observe a male calico cat, then Briggs statement says that it must not be a calico cat.
JH: Thanks for letting me know.
There are plenty of models that are not stated in probability terms. Like, there is this solar eclipse tomorrow morning between 9 and 10, because of this planetary theory. If the eclipse starts at eleven then that particular planetary theory is falsified. Even if there is some wigglimng room because it is based on measurements that result in parameters that are nog perfectly known.
We might look at some famous examples of falsification.
If the earth revolves around the sun, there should be parallax among the fixed stars. Aristotle pointed out that there was none, and therefore (in our terms) the hypothesis was falsified. The Pythagoreans insisted that it did too because woo-woo; and Aristotle accused them of bending the facts to fit the theory when the theory ought to be bent to fit the facts.
Later, the Copernicans said that the parallax was certainly present, but it was too small to be seen because the stars were much farther away than previously believed. But you cannot save one unproven hypothesis by introducing a second unproven hypothesis. (And by then a second falsification had been introduced: if the earth is turning toward the east with a diurnal motion, objects dropped from a tower should fall east of the plumb line. No such consistent deflection was observed. (Later Newton will suggest dropping a musket ball, which Hooke did; but he reported no deflection. Case closed, modus tollens? Not on your life.)
Tycho Brahe estimated the distance to the fixed stars by using their apparent diameters. Procyon had about the same diameter as Saturn. If it was much more than 100x the distance of Saturn, it would be ginormous; larger than the sun, larger than the entire (solar) system. All the stars would be; a new class of objects of incredible size. If the earth were stationary, all the difficulties would go away.
The Copernicans responded “Goddidit!” Since God was infinite, who cared how big the stars might be?
It was not until Kepler replaced the Copernican model with his own elliptical model that the issue was settled; but it was settled on the grounds of mathematical elegance rather than physical evidence. The Rudolphine Tables were simply easier to use than either the Alfonsine Tables or the Prussian Tables. The lack of observable parallax and observable Coriolis effects were not overcome, but were assumed out of existence.
In the 1790s, a series of experiments by the Jesuits discovered and measured the Coriolis effect in plummeting objects. In 1803, stellar parallax was observed (a false alarm as it turned out), but confirmed by Bessel in 1830. Meanwhile, stellar aberration had shown that the earth was indeed moving relative to the fixed stars. In the mid-1800s, the apparent discs of the stars seen by eye and the telescopes of the era were shown by George Airy to be optical illusions due to atmospheric aberration.
So by the early-to-mid-1800s, the dual motions of the earth were established as physical fact and not merely as a mathematical convenience. The original falsification was falsified.
Which shows how difficult it is to falsify something. The basics are simple:
If P, then Q
Not Q
Therefore Not-P
But Duhem pointed out that there is never just one P, but a host of assumptions are built into the inference:
If P1 and P2 and… and Pn, then Q
Not Q
Therefore either not-P1 or not-P2 or… not-Pn
But which one?
The host of Pi’s comprise what we today call the Model; and as George Box famously observed, “all models are wrong.” (Which doesn’t mean they can’t be useful under some circumstances.) One of the reasons the models are always wrong is that we can never account for all the factors in the real world, and Ockham warned us that if we have too many entities in our models, we won’t understand the models. So, it’s always a matter of trimming the number of P’s while still getting “close” to Q and hope that you haven’t left out a factor that might prove crucial under some other circumstance. Like the speed of light for Newtonian mechanics.
See http://tinyurl.com/h2hxzz3
So many problems. Here’s just a few of them, in no particular order:
a) In all the best scientific circles, calling something ‘philosophical’, is an insult.
b) ‘Falsification’ may be even more resistant to re-evaluation than ‘p-values’, because only ‘rigorous’ scientists have even heard of ‘falsification’.
c) It can’t possibly be the case that we have forgotten/abandoned anything important.
d) We live in Modern Times; no, not the supposed Modern Times ten or one hundred years ago. THIS is the REAL Modern Times.
On the other hand, all is not lost. It is now possible to falsify the contention that Luke Skywalker destroyed the original Death Star in Episode IV:
http://www.liveleak.com/view?i=218_1363518864
“All the facts in this video are based on facts. Real facts. All events, names and places that are real, are real. This video exists and all the facts in it are, I swear to God, true.”
YOS: Excellent! Except All Models Are Not Wrong.
JohnK: a—maybe, but it is accurate.
b—wind opposition groups often refer to falsifiability as a requirement for a scientific theory
Fah,
I think you’d like the following two posts from an old blog hosted by Massimo Pigliucci, one of my favorite philosopher bloggers. I am not a philosopher, so I rather quote what I have read. The criterion of falsifiability is not proposed to solve the problem of truth or acceptability or model selections, but of drawing a line between science and pseudoscience.
(May I suggest that you break your comments into separate posts? Submitting them all in one post make it too long to entice me to read it. )
http://rationallyspeaking.blogspot.com/2014/01/sean-carroll-edge-and-falsifiability.html (by Massimo Pigliucci)
https://www.edge.org/response-detail/25322 (by Sean Carroll)
You bastards – my head exploded. And I’m dying of laughter. I’ll get you for that!
You bastards – my head exploded.
Maybe bugs like in the recent merger of “The Good Wife” and “The Puppet Masters”.
JH: Thank you for the references and the advice. I followed the references, but I fear I may fail at following the advice.
For me, the point of doing science or math is personal, not philosophical. I am not inclined to philosophy. Nor am I inclined to try to categorize what others do as science or pseudoscience. Nor am I inclined to prove I am right (although that ego monster rears its head sometimes until I am able to ignore it). In fact, my experience has been that being proven wrong means I learned something new and leads to a great deal of fun (as long as I didn’t build something I said would work, like a bridge).
That said, for me (emphasis for me not necessarily for you) there are two motivations to do science or math: if it is fun and/or if it is useful. Neither of these reasons includes determining Truth or Falsity, or the True Nature of Reality, or doing Science instead of Pseudoscience or any such thing. It is either fun or it is useful. Useful in a practical, everyday sense. Many find multiverses and strings and branes great fun. If they can support themselves doing that then more power to them. That is enough reason to do it. Hamilton invented quaternions and Feynman invented the path integral formulation of quantum mechanics. Both are great fun, but maybe not so useful, at least not directly. Dirac invented the delta-function out of thin air because it was useful and a multitude of mathematicians had great fun constructing mathematical foundations for it. I happen to like fiddling with number theory and prime numbers in quiet moments. Not so I can make a zillion dollars on prizes or breaking encryption open, but because it is fun. So worrying about falsification and demarcation and what is or is not scientific is just not part of the practice of science (and math) as I see it.
Another personal view. To me seeking Truth or Falsity or the True Nature of Reality is a spiritual activity not a scientific activity. It has different motivations and different methods than my view of science. My personal bent for most of my life has been Buddhism, the zen variety. Some say it is fun and some say it is useful, but that is not the point. The point is that such a quest satisfies the motivation to seek Truth. Maybe the motivation is the point and not the path taken.
I apologize for long posts. I multitask a good bit and am usually on deadlines to write several different things by certain dates. I find that if I stop writing something, there is a good chance I won’t start back at it for a while, so I tend to binge write. Sorry.
Not only is falsifiability “not that useful” it is completely unfriendly to mathematism’s attempts to justify or rationalise scientism’s favourite superstitious dogmas.
Here we go again… trying to create a smokescreen behind which the impossible becomes not just possible but a certain “fact”.
Scientism abhors falsifiability, replicability, logical coherence with known facts and all that because lots of “credible theories” and mathemagical “models” cannot stand that sort of scrutiny. Instead of chucking an ideologically convenient speculation because it is impossible we chuck the tests that falsify the nonscience.
It’s one thing for a fantastic assumption like atheism or agnosticism to repudiate observation and logic to facilitate their prejudice but it really gets my goat up when supposed “Thomists” try to dismantle the very essence of scientific investigation (observation and logic) to rationalise ideological prejudices. The guts of Tom’s Scholastic Method is the logical falsification of “theories”. If you don’t like that then you can’t claim to be a “Thomist”
Pingback: Essence Is Of The Essence – William M. Briggs
Fah, no apology needed. (I was making a selfish suggestion.) My attitude is very similar to yours. I am blessed with a profession that affords me to do what I enjoy very much. Always learning! 🙂