Many (thank you everybody!) people sent me the “Odds Are, It’s Wrong” article by Tom Siegfried and have asked me to comment. Below are the key points; I will assume that you have already read Siegfried’s article. And stay tuned, because I’ll be having more of this soon.
P-Values
The inventor of p-values, R.A. Fisher, was an enormous intellectual fan of Karl Popper. Popper was also a huge fan of himself. Popper arrived at the idea that you could never know whether anything is true through induction.
You could only know if something is false through deduction. He also, somewhat oddly, said things like our degree of belief in the truth of any theory was just as high after we saw any corroborative data as it was before we saw any data.
What really mattered, he said, was whether a theory was falsifiable; that is, that it was contingent and that observations could be imagined (not necessary ever observed) that would prove the theory wrong. Most epistemologists nowadays, and the most famous ones like John Searle, no longer buy falsifiability. But Fisher did.
Sort of. Fisher knew that no probability theory was falsifiable. If a probabilistic theory said, “If theory M is true, X is improbable, with as low a probability as you like, but still greater than zero” than whether or not X happens, M cannot be proved true or false. Specifically, it can’t be falsified. Ever.
But, Fisher reasoned, how about if we made a probability criterion that would indicate that something was “practically” falsified? This criterion can not—it must not!—say whether any theory was true or false, nor could it say a theory was probably true or probably false.
It would, instead say that if the theory was true, how likely or unlikely were the results we saw? A p-value sort of does that, but it does it conditional on the equality of unobservable parameters of probability models.
But never mind that. If you ran an experiment and received a p-value of 0.05 or less, it meant it was publishable. It meant some statistic was improbable, assuming that some “null hypothesis” about parameters was true.
What it did not mean was that your theory was likely true, or that it was likely false. P-values cannot be used in any way to assert probability statements about theories. It is forbidden to do so in classical statistics.
Siegfried’s point is that nobody ever remembers this, and that nearly everybody uses p-values incorrectly.
He is 100% right.
A small p-value is supposed to mean a study was “statistically significant.” But as we have talked about over and over, if you run a study and cannot find a publishable (<0.05) p-value, it only means that you haven't tried hard enough. In other words, statistical significance is nearly meaningless with respect to the truth or falsity of any theory. "Statistical significance" should join the scrap heap of science, whose historical inhabitants N-rays, cold fusion, phlogiston, ectoplasm, and others will welcome it with open arms. Randomization
Coincidentally (?), we talked about this the other day. In my comments, I lazily asked readers to look up some quotes from biostatistician Don Berry. Siegfried has done the work for us.
In an e-mail message, Berry points out that two patients who appear to be alike may respond differently to identical treatments. So statisticians attempt to incorporate patient variability into their mathematical models.
“There may be a googol of patient characteristics and it’s guaranteed that not all of them will be balanced by randomization,” Berry notes.
Note: that’s “googol” the number and not that internet company. A googol is 10100. That seems right to me. Notice, too, that, as I said, Berry said, “it’s guaranteed” that imbalances between groups will exist. He goes on to say that it’s control—whether in modeling or in the experimental design—is what is important.
Amen, brother.
Bayes
There is immense confusion about “prior” probabilities, or “prior distributions.” I won’t be able to end that confusion in one hundred words.
What I can tell you is that the central misunderstanding stems from forgetting that probability is conditional. Probability is a measure of information, or logical degree. Just as you cannot say whether or not a conclusion of some logical argument is true or false without first knowing its premises, you cannot know the probability of some argument without knowing its evidence (premises by another name).
Forgetting this simple fact is what leads some people to mistakenly believe probability is subjective. If probability is subjective, critics say, that any prior can equal anything. They’re right. But if probability is conditional, then priors cannot equal anything, and must be fixed, conditional on evidence.
The other difficulty comes in jumping to infinity too quickly. Probability measures on continuous spaces inevitably lead to grief. But more on that another day.
By definition, as I understand it, a p-value of 0.05 means that the distribution in the results would occur once in every twenty times by chance, if the null hypothesis applied. It is easy then to suspect that one in every twenty published studies with a p-value of 0.05 is reporting sheer chance rather than a correlation, but that does not really follow, does it? This p-value can only be applied to the total number of studies done, regardless of the p-value obtained, and people seldom write papers about studies yielding p-values greater than 0.05.
Can anything be concluded about groups of studies that have the same p-value?
The problem, as I understand it, is when you use a p-value of 0.05, but you have a study testing 400 separate variables or combinations of variables. In a case like that, you’d expect about 20 random “positive” results. Ideally they should do follow-up studies on those results, but what apparently happens more often than not is they take the most “statistically significant” result and run with it, particularly if that “statistically significant” result happens to jibe with one or more current theories about societal evils. And that’s how you end up with ludicrous results like “mothers who eat cereal while pregnant are more likely to give birth to boys”, a concept which illustrates a complete lack of understanding of what I believed was a fairly basic idea in genetics.
Not only do vast numbers of published studies use p-values incorrectly, many of them calculate p-values incorrectly, and many report “findings” when the p-values are well above 5%. The phony litmus test is ignored when it is convenient to do so, anyway.
Popper was on to something with his falsifiability rule. He cannot be dismissed lightly. Furthermore, p-values falsify nothing, so they are NOT examples of Popperian science. Conflating Fisher p-values with Popper’s falsifiability is a little too facile for my taste. Let’s not throw the baby out with the bath water.
I think the issue is more profound: What we are trying to do is say something that will be empirically true so long as a certain set of conditions hold. But we may not know at the time of the experiment whether we actually control the relevant conditions. For well bounded problems we may be in more control of the relevant conditions and this results in clean repeatable unvarying experimental results. For more complex problems like medical trials, our confidence in the results must be constrained by our knowledge of the conditions of the experiments.
For example, assume that you tested a new drug for acne on 1000 patients and it cleared up the acne in all 1000 patients – 100% success! If all your patients in the trial were males, what would you need to know in order to say that the drug would have the same effect for female patients. Now assume that it worked for 900 of the 1000 male patients.
In my mind and as I noted in the earlier post on randomized trials, too many papers are written where there is an unspoken reliance on miracles!! See: http://www.cartoonbank.com/I-think-you-should-be-more-explicit-here-in-step-two/invt/118181
I pinched the following from Pat Frank’s email footer
These things are, we conjecture, like the truth;
But as for certain truth, no one has known it.
Xenophanes, 570-500 BCE
Ha… I chatted about how Google got its name with a colleague just yesterday.
I thought it might be helpful to reiterate one of the differences between Bayesian and classical statistics. Bayesian computes Prob(parameter | data), and classical statistics Prob(data |parameter). The former draws probability inferences about the parameter of interest conditional on the observed data. The latter finds the probability (p-value) of observing the data (or more extreme) assuming the particular parameter value under the null hypothesis and a rejection of the null doesn’t prove the alternative is true.
I think the choice/modeling of a prior distribution is just part of Bayesian statistical modeling. Nothing more than that. I am not saying it’s easy to do though.
Of course, Statistics is much more than just p-values and the choice of a prior!
Always, it’s wise to consult a statistician when undertaking a project. You obviously know where to find one. . ^_^
I propose we drop the phrase “95% confidence” and replace it with “at least 5% bullsh!t”.
I wouldn’t be too hasty in assigning “cold fusion” to the scrap heap of science. There are new finding being published in this area, now termed Low Energy Nuclear Reactions (LENR), that are repeatable, reproducible and generate too much energy to be mere chemical reactions. Search YouTube for “Missouri LENR” for recent videos from a LENR conference.
Dr. Briggs,
I thought Neyman developed the concept of confidence interval. Am I having memory problems or misunderstanding?
The problem, as I understand it, is incomprehensible.
JH says:
“Bayesian computes Prob(parameter | data), and classical statistics Prob(data |parameter). The former draws probability inferences about the parameter of interest conditional on the observed data. The latter finds the probability (p-value) of observing the data (or more extreme) assuming the particular parameter value under the null hypothesis and a rejection of the null doesn’t prove the alternative is true.”
This is even more incomprehensible than the problem which, as I understand it, is incomprehensible. I make an inference: There are degrees of incomprehensibility. This, I find, is in itself incomprehensible. It is also unfalsifiable. It has a p-value but this p-value, I intuit, is unascertainable though, to my mind, not categorically unattainable if p is assumed to be √p where the parameters are known to be unknown conditional on unobservable data. Therefore I construct a premise: Let Popper be pr and Phlogiston be pn. Let pr be the baby and pn be the bathwater. Let the data be the bathtub (bb) within the parameters of galvanized ferrous metal. I ask a Question: When pn is subtracted from bb is pr also subtracted? I assume falsifiability.
I write a novel in which Statisticians by means of stealth blogs numb the minds of all mankind and take over the world. I endow it with a p-value of .005. It is publishable. I celebrate and, drunk as a Popperian Imperative, wander in front of what is probably the SUV of a Bayesian denouncing a Fisherian on his cell phone.
I become just another statistic.
I generally agree, though I am not sure that “M cannot be proved true or false.”
I have lost my lucky 4-leaf clover. My theory (M) is that I lost it in my back yard. If true, then the event of my finding it there (X) is very unlikely. If I do happen to find it there though, then M is proved true.
SteveBrooklineMA said: ‘I am not sure that “M cannot be proved true or false.†… If I do happen to find it there though, then M is proved true.’
Only if you can also prove the following:
1. it is the same lucky 4-leaf clover and not a replica thereof;
2. it was not moved to your back yard after it was lost by some other means, i.e., dropped in the front yard, stored in a keepsake box under the bed, or stolen by gypsies.
Ray,
He did indeed. Interesting, Doug M may have the best interpretation of a CI.
Suppose, for your problem, you calculated a 95% CI of, say, 8 – 10. That is, of course, an interval of size 2 (the units are not important). What can you say about that interval?
a) Can you say there is a 95% that the true value of the parameter is in that interval? NO
b) Can you say that if you repeated the experiment a “large” number of times, then 95% of the time (assuming you calculate a new CI for each replication) those CIs will cover the true value of the parameter? NO
c) The only thing you can say about any CI is this: Either the true value of the parameter is in the interval or it isn’t. This being so, take any two numbers you like (I pick 7 and 13) and it will always be the case that for your experiment, either the true value of the parameter is in that interval or it isn’t.
That’s it, folks. And it shouldn’t be surprising, because it turns out p-values and CIs are mathematically/philosophically related (there is an equivalence in frequentist theory between testing and estimation). P-values say nothing about the truth or falsity of any theory or model. Neither do CIs.
Liamascorcaigh,
Very funny! For your efforts, I present you my new Avartar (an angry woman) in the upper-right corner. I stole it off the internet and will change it after tomorrow. Oh…I bet you’d enjoy this book. ^_^
Briggs, what do you think of Lubos Motl’s take on this article?
http://motls.blogspot.com/2010/03/defending-statistical-methods.html