Everything Wrong With P-Values Under One Roof

Everything Wrong With P-Values Under One Roof

Update 8 January 2019. This post has been superseded! There is an official paper of this material, greatly expanded and vetted. SEE THIS NEW POST.

THE NEW PAPER:

Here is a link to the PDF.

Briggs, William M., 2019. Everything Wrong with P-Values Under One Roof. In Beyond Traditional Probabilistic Methods in Economics, V Kreinovich, NN Thach, ND Trung, DV Thanh (eds.), pp 22–44. DOI 978-3-030-04200-4_2

THE OLD MATERIAL STARTS HERE

Handy PDF of this post

See also: The Alternative To P-Values.

They are based on a fallacious argument.

Repeated in introductory texts, and began by Fisher himself, are words very like these (these were adapted from Fisher, R. 1970. Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, fourteenth edition):

Belief in a null hypothesis as an accurate representation of the population sampled is confronted by a logical disjunction: Either the null hypothesis is false, or the p-value has attained by chance an exceptionally low value.

Fisher’s choice of words was poor. This is evidently not a logical disjunction, but can be made into one with slight surgery:

Either the null hypothesis is false and we see a small p-value, or the null hypothesis is true and we see a small p-value.

Stated another way, “Either the null hypothesis is true or it is false, and we see a small p-value.” Of course, the first clause of this proposition, “Either the null hypothesis is true or it is false”, is a tautology, a necessary truth, which transforms the proposition to “TRUE and we see a small p-value.” Or, in the end, Fisher’s dictum boils down to:

We see a small p-value.

In other words, a small p-value has no bearing on any hypothesis (unrelated to the p-value itself, of course). Making a decision because the p-value takes any particular value is thus always fallacious. The decision may be serendipitously correct, as indeed any decision based on any criterion might be, and as it often likely correct because experimenters are good at controlling their experiments, but it is still reached by a fallacy.

People believe them.

Whenever the p-value is less than the magic number, people believe or “act like” the alternate hypothesis is true, or very likely true. (The alternate hypothesis is the contradiction of the null hypothesis.) We have just seen this is fallacious. Compounding the error, the smaller the p-value is, the more likely people believe the alternate hypothesis true.

This is also despite the strict injunction in frequentist theory that no probability may be assigned to the truth of the alternate hypothesis. (Since the null is the contradiction of the alternate, putting a probability on the truth of the alternate also puts a probability on the truth of the null, which is also thus forbidden.) Repeat: the p-value is silent as the tomb on the probability the alternate hypothesis is true. Yet nobody remembers this, and all violate the injunction in practice.

People don’t believe them.

Whenever the p-value is less than the magic number, people are supposed to “reject” the null hypothesis forevermore. They do not. They argue for further testing, additional evidence; they say the result from just one sample is only a guide; etc., etc. This behavior tacitly puts a (non-numerical) probability on the alternate hypothesis, which is forbidden.

It is not the non-numerical bit that makes it forbidden, but the act of assigning any probability, numerical or not. The rejection is said to have a probability being in error, but this is only for samples in general in “the long run”, and never for the sample at hand. If it were for the sample at hand, the p-value would be putting a probability on the truth of the alternate hypothesis, which is forbidden.

They are not unique: 1.

Test statistics, which are formed in the first step of the p-value hunt, are arbitrary, subject to whim, experience, culture. There is no unique or correct test statistic for any given set of data and model. Each test statistic will give a different p-value, none of which are preferred (except by pointing to evidence outside the experiment). Therefore, each of the p-values are “correct.” This is perfectly in line with the p-value having nothing to say about the alternate hypothesis, but it encourages bad and sloppy behavior on the part of p-value purveyors as they seek to find that which is smallest.

They are not unique: 2.

The probability model representing the data at hand is usually ad hoc; other models are possible. Each model gives different p-values for the same (or rather equivalent) null hypothesis. Just as with test statistics, each of these p-values are “correct,” etc.

They can always be found.

Increasing the sample size drives p-values lower. This is so well known in medicine that people quote the difference between “clinical” versus “statistical” significance. Strangely, this line is always applied to the other fellow’s results, never one’s own.

They encourage magical thinking.

Few remember its definition, which is this: Given the model used and the test statistic dependent on that model and given the data seen and assuming the null hypothesis (tied to a parameter) is true, the p-value is the probability of seeing a test statistic larger (in absolute value) than the one actually seen if the experiment which generated the data were run an indefinite number of future times and where the milieu of the experiment is precisely the same except where it is “randomly” different. The p-value says nothing about the experiment at hand, by design.

Since that is a mouthful, all that is recalled is that if the p-value is less than the magic number, there is success, else failure. P-values work as charms do. “Seek and ye shall find a small p-value” is the aphorism on the lips of every researcher who delves into his data for the umpteenth time looking for that which will excite him. Since wee p-values are so easy to generate, his search will almost always be rewarded.

They focus attention on the unobservable.

Parameters–the creatures which live inside probability models but which cannot be seen, touched, or tasted—are the bane of statistics. Inordinate attention is given them. People wrongly assume that the null hypotheses ascribed to parameters map precisely to hypotheses about observables. P-values are used to “fail to reject” hypotheses which nobody believes true; i.e. the parameter in a regression is precisely, exactly, to infinite decimal places zero. Confidence in real-world observables must always be necessary lower than in confidence in parameters. Null hypotheses are never “accepted”, incidentally, because that would violate Fisher’s (and Popper’s) falsificationist philosophy.

They base decisions on what did not occur.

They calculate the probability of what did not happen on the assumption that what didn’t happen should be rare. As Jefferys famously said: “What the use of P[-value] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.”

Fans of p-values are strongly tempted to this fallacy.

If a man shows that a certain position you cherish is absurd or fallacious, you multiply your error by saying, “Sez you! The position you hold has errors, too. That’s why I’m going to still use p-values. Ha!” Regardless whether the position the man holds is keen or dull, you have not saved yourself from ignominy. Whether you adopt logical probability or Bayesianism or something else, you must still abandon p-values.

Confidence intervals.

No, confidence intervals are not better. That for another day.

Handy PDF of this post

See also: The Alternative To P-Values.


44 Comments

  1. DAV

    Isn’t the definition of “confidence interval” the time between confidence games?

  2. Briggs

    DAV,

    You remind me I miss seeing the guys on lower Broadway and 6th playing Three-Card Monte. Guliani (rightly) ran them all off. Most of them probably went into politics.

  3. Briggs, I understand your argument and it is certainly logically coherent (unlike the presentations in most statistics texts). However, as a working scientist (retired) and part-time statistician, p-values offer an operations means for credence. You agree to set your limits on the probability of what is unlikely to hold at a certain PRESET value. For example, if you are studying ESP, that p-value is going to be very low (unless you’re a medium or a fortune-teller), say .000001 (or less). If you’re looking at a neat elegant theory, for which the experimental verification (i.e. non-falsity) is going to be very difficult, you set your p-value for belief relatively high (say, .05–the standard for rejecting the null-hypothesis). So, in my view the p-value is an operational parameter for which the required value requires a context.
    Thanks for an interesting blog.

  4. Ray

    I always wondered how an unmeasurable statistical parameter could have a distribution which you have to assume to calculate a P value.

  5. DAV

    Bob,

    The true test of a model is its predictive power and not how well the parameters have been established which is all the p-value is going to indicate, About all you can say with an NxN table is there might be something there. As you’ve indicated it’s highly subjective,

  6. For me the whole issue of P values is interesting regarding how it affects many of the arguments for inductive science over math and other deductive reasoning. Since most scientific knowledge is based upon inductive reasoning and inductive reasoning relies on probability then the discussion of disparity of agreement about P values seems to knock the wind out of many righteous arguments for “scientific” claims of superiority as a methodology.

  7. Scotian

    What an odd thing to say, Bob (Mrotek). Look around you, the world is full of the products of scientific and engineering advances. Math is certainly very important but it is very limited in what it can produce on its own.

    If I follow you Briggs, you are saying that statistics is also of limited value on its own. It should only be applied to phenomenon that have strong prior evidence, e.g. to determine the parameters of an empirical or theory driven equation or to compare competing theories. There are still risks but if you can’t eyeball it or the effect is weak, then you are wasting your time. This is the Rutherford position.

    As you have said before, the abuse of statistics exists because of publish-or-perish pressures on the one hand and the human need for certainty in a chaotic world on the other – sort like tea leaf reading. It is this pressure that allows people, even statisticians, to ignore what they otherwise would see as obvious.

  8. DAV, with respect to the predictive value of statistical tests (as distinguished from the predictive value of the theory, i.e. hypothesis they’re meant to test), my opinion is that they are irrelevant. They do support or diminish one’s confidence or belief in the particular theory, but other factors also enter in. Consider the eclipse tests of Einstein’s general relativity theory in 1919 (see http://astro.berkeley.edu/~kalas/labs/documents/kennefick_phystoday_09.pdf ); one interpretation has it that Eddington threw out data falsifying the theory on the grounds that the larger mirror with the rejecting data was more likely to be in error than the smaller one, from which results confirmed the theory. As it happened, many more experiments also confirmed the theory, so Eddington’s choice (based on his good intuitive sense of what was good science) was correct.
    Other experiments with “nice” p-values (I’m thinking of some of the cold fusion studies) have turned out to be trash.
    Where p-values might be useful in a clinical context is where there is a positive result for rare diseases or conditions (e.g. a misleading positive test for AIDs) such that Bayesian analysis will show that the probability of having the disease after a positive test might be very small, even though the test is fairly sensitive and specific, if the disease is rare (see http://www.math.hmc.edu/funfacts/ffiles/30002.6.shtml ). Here I think p-values have a well-defined meaning as distinguished from their use in hypothesis tests.

  9. And is it correct to tell this non-statistician that probability values deduced from a 2×2 Bayesian analysis of a clinical test aren’t really p-values (as statisticians term them)?

  10. Nullius in Verba

    “This is evidently not a logical disjunction, but can be made into one with slight surgery”

    A logical disjunction is two predicates connected by the word “or”. It sure looks like a disjunction to me!

    Where it goes wrong is in the next step. The argument goes: Either the null is false or something very unlikely has occurred. It is probably not true that something very unlikely has occurred. Therefore it is probably true that the null is false.

    This is modeled after the *valid* logic: Either the null is false or something impossible has occurred. It is certainly not true that something impossible has occurred. Therefore it is certainly true that the null is false. It looks like the same argument, but with ‘improbable’ being used an an *approximation* of ‘impossible’, and therefore approximately right. However, it turns out to be a different sort of approximation.

    But apart from that particular common misunderstanding, I don’t see why any of the other reasons you offer should lead to p-values being abandoned. It’s a matter of interpreting them correctly. I don’t expect this is a point we’ll ever agree on, though. I only mention it because feedback on whether particular presentations work can sometimes be useful.

  11. DAV

    Bob,

    “Other experiments with “nice” p-values (I’m thinking of some of the cold fusion studies) have turned out to be trash”

    Isn’t that indicating that, when it comes to confirmation of theory, p-values are about as useful as reading tea leaves? Or maybe just spotty? If the latter, how would you know when they are indicating correctly?

    “Where p-values might be useful in a clinical context”

    A side issue perhaps: I don’ think anyone has ever done a statistical study to determine if interception of high velocity pellets with the head is good or bad. But, if they have, it was largely a waste of time because the effects are rather obviously large. If the effect is so small it is only detectable through statistical analysis, one has to wonder about the point. For instance, of what use is a drug that only worked (i.e., was more effective than doing nothing) 1% of the time? True, desperate people do grasp at straws but I would call that ineffective.

    And to get there requires guessing the statistical threshold between effect/no-effect. There is also that build-in assumption that the only real difference between the dosed/undosed populations was the dosage. (This may not apply to you since you are a Bob).

    Take your earlier point about ESP. Suppose you were to run an experiment and you correctly guessed the threshold between possessing/not-possessing ESP and you found someone who could guess/see/experience the next card with 99.9% accuracy. The only thing your low p-value tells you here is that you may be on to something. It however tells you nothing about whether the observed prediction rate was due to ESP or an experimental protocol that delivered subtle clues.

    Years ago, when I was into neural networks, I recall a story about a firm that built a NN under contract. It performed really well during testing. It failed miserably during the first demonstration. It was eventually discovered that the operator during training was subtly (and unknowingly) supplying the answers to the tests. When a different operator was used (during the demo), the NN failed. The NN had faithfully learned the training operators quirks.

    The point is, it is very difficult to remove extraneous information — particularly when it comes to dealing with people (and NNs). A good p-value tells you very little.

  12. Scotian

    Dav,

    The Clever Hans effect with computers. Amazing!

  13. JohnK

    Berger and Selke’s 1987 paper (which Matt has cited, at least in his ‘Breaking the Law of Averages’, if not elsewhere on his site), really did it for me regarding p-values.

    That is, even apart from logical problems with the ‘null hypothesis’ setup, there is an assumption upon which the whole p-value mechanism is founded.

    Put more clearly than we’re allowed to put it, we are told to ‘expect’ that a high p-value means that the two central parameters of the probability distributions of the two populations are (nearly) the same, and to ‘expect’ that a low p-value means that the central parameters of the two populations are different.

    (As Matt points out, you can’t really talk as clearly as I have just done about these questions, because that would violate some of the formalisms of classical statistics — hence the language of ‘rejecting’ the ‘null hypothesis’, rather than ‘accepting’ the hypothesis that the two central parameters are ‘different’, etc.).

    The ‘null hypothesis’ states that the central parameters of the probability distributions of the two populations are the same. Much of the framework for classical statistics relies on the belief that if a p-value is very low, that makes it very unlikely that the central parameters of the two populations are in fact the same. Therefore, if we see a satisfactorily ‘low’ p-value, we can ‘reject’ the statement that the central parameters of the probability distributions of the two populations are the same. We can ‘reject’ the ‘null hypothesis’.

    However, in 1987 — over 25 years ago — Berger and Selke proved that to be FALSE. They showed that a low p-value does NOT necessarily make it ‘very unlikely’ that the two central parameters are the same.

    But this assumption is the central assumption of the entire p-value-null-hypothesis mechanism!

    Not only is it true that p-values don’t make much sense, and are misused and misinterpreted every day far beyond what the rigorous formalism says is possible; but also, the very foundation of p-values is not rigorous. Berger and Seltzer showed that the assumption behind the use of p-values fails, and fails unpredictably, in the sense that you can’t tell from the p-value itself whether the underlying assumption has failed or not.

    Berger JO, Selke T. Testing a point null hypothesis: the irreconcilability of p-values and evidence. JASA 33:112-122.

    As I said, that was the end of it for me, regarding p-values. Even aside from logical and other flaws in the classical treatment of statistics, Berger and Seltzer proved over 25 years ago that the entire assumption under which p-values are considered useful within classical statistics, holds only unpredictably.

    And the p-value itself cannot tell you when the assumption holds, and when it doesn’t.

    I don’t know why everybody doesn’t know this. I know from my research that since then, Berger has tried, not just informally but in written papers, to make peace with frequentists, to try as far as possible to extend his argument to capture the value in frequentist statistics. He’s a bridge-builder, and good for him.

    But as far as I know, the original paper has never been refuted, and I could find nothing in Berger’s own writings and conversation in which he stepped back one iota from the 1987 paper.

    The assumption behind p-values is, provably, a chimera, a will-o-the-wisp. Formally, we CANNOT depend on p-values even to say the extremely limited formalist thing that it has been assumed that they can say.

    The whole p-value edifice is built on a mirage. And this is known.

  14. Dav, thanks for your insightful comments. We’re possibly in more agreement than disagreement. I don’t think p-values are an infallible measure of the validity of an experimental test of a theory, but I do believe they are more valuable than tea leaves. And my perspective is not really in terms of p-values but that of the significant error (the +/- number following the experimental result (which can be translated to a p-value).

    One more example comes to mind. In the 60’s and 70’s Bell’s Theorem (which was to show whether hidden variables were possible in quantum mechanical theory) was being tested by optical experiments at Berkeley and Harvard.. They gave two different answers (corresponding to big p-values and small p-values)–see http://www.controversia.fis.ufba.br/index_arquivos/Freire-SHPMP-37-2006.pdf.) Accordingly, the contradictory p-values themselves weren’t useful, but the experiments then led to the further experiments by Aspect in 1981 and 1982, which did controvert Bell’s Theorem and validate quantum mechanics (i.e. the experimental error in Aspect’s work was sufficiently narrow and the experimental conditions were well controlled, removing ambiguities not resolved in previous work.) And even then, there were some factors not covered in Aspect’s work that were taken care of in subsequent experiments.

    My point here is that the small p-value–which in the context of physics experiments is a small experimental error–is a guide, and a better guide than tea leaves or casting organs, but it isn’t infallible, because the experiment itself may be confounded (is that an appropriate statistical term here?) by unaccounted factors. The experiment has to be well-designed (I’m speaking for physics, but I assume it’s true for other sciences also). In both examples I’ve cited, the tests of cold fusion and of Bell’s Theorem, it turned out that continuing experiments confirmed (or refuted) the hypotheses. The further experiments were carried on despite the contradictory p-value results so in that sense your distrust is justified, but it required an analysis of experimental conditions and repetitive experiments to show that these p-value results could be ignored.

    So here’s a toast to well-conceived experiments with small experimental errors (and small p-values) and to good intuitions that know when to disregard p-values, large or small.

  15. Francsois

    The post by Briggs makes sense to me, sort of. But then I read the comments where some defend the use of p-values, and that makes sense to me too. I page through my statistics (frequentist) textbooks, and THAT makes sense too. I conclude that I do not understand the issue. Since I cannot use my own reason to distinguish what is right from what is wrong, I have to have FAITH in the position Briggs or his detractors take. So who do I side with? It reminds me of the story of Galileo. When Galileo and Newton et al tried to tell people the real structure of the solar system, few people believed them (admittedly Newton did not try to win people over so much). This is because people could not critically evaluate Principia or Revolutions etc. What should those people have believed? Remember, it was common sense at the time that the sun revolved around the earth etc. So when Briggs tells us these things, what should the common man believe?

    Similarly, I cannot critically evaluate the evidence from the IPCC on global warming. But all the scientists seem to believe that man-made carbon dioxide causes blah blah…… Who do we believe, and why? P-values are used by all or at least most researchers in healthcare. Perhaps they are all wrong, and Briggs and others like him are like Galileo!?!

    Cheers

    Francsois

  16. Exactly Francsois! With things like P values we are dealing with “degree of uncertainty” and not “certainty” because we are using induction which is a bit like fortune telling. That is why in order to claim they are closer to certainty than the next guy or gal people keep piling up statistics. Professors like Dr. Briggs are the most believable because they use just the right amount of skepticism and common sense. That is the best that science can do without proof which does eventually come but only after the fact. To know the outcome in advance one needs to use deductive reasoning. Sure we might say that science has lead to many marvelous things to improve our lives by measuring the touchy-feely, hefty, sparkling, smelly, noisy, and tasty qualities of things to a certain degree but we cannot actually claim scientific proof that is entirely devoid of bias. On the other hand one could also demonstrate that some useful things have come into our lives through reason and mathematics, intuition, luck, and perhaps even divine intervention. Miracles can happen you know. George Bush could look into a man’s eyes and see his soul and with nothing more than that and his gut feeling he could judge character 🙂 My faith is in God, Galileo, and Briggs in that order.

  17. Yo Francois…”. But all the scientists seem to believe that man-made carbon dioxide causes blah blah……”…That statement isn’t correct. I can cite a large number of reputable scientists, including the endowed chair of meteorology at MIT, who do not credit that statement. Please let me know if you’d like references.

  18. Francsois

    Thanks to the two Bobs! Bob Kurland: no need to cite, I believe you. Even if I read the oppsong views you suggest, I would not know who to believe, as I cannot sensibly judge what is right from wrong with the climate change issue. Let us hope that the people whose job it is to make decisions can make the distinction. Everything confuses me these days.

    Cheers

    Francsois

  19. JH

    They are not unique: 1.
    Yes, there may be more than one test statistics for a particular hypothesis, but test statistics are not arbitrary! What is a pivotal quantity! (I have mentioned this before.)

    They are not unique: 2.
    I am not sure what “ad hoc” and “different models” mean exactly here. Some “ad hoc” rules have theoretical bases, some don’t. Different deterministic functions or different probabilities? For a different deterministic function, then we would be testing a different set of parameters. If we want to use a different probability model, there are diagnostic tools. Though statistical models are never true, but they are not and should not be chosen arbitrarily.

    They can always be found.
    Let me not repeat the great answer here: http://stats.stackexchange.com/questions/2516/are-large-data-sets-inappropriate-for-hypothesis-testing . A great site! I have found several well-known academic statisticians participating in the discussions, which we don’t usually see on a blog.

    They focus attention on the unobservable.
    It’d depend on what our question is. There are times we have to focus on the parameters. For example, Q1: how would you compare a new treatment to a placebo in the following? I am your boss, I need to make decision on whether to apply for FDA approval.

    Here’s the setup. You’ve invented a new treatment to cure cancer of the albondigas and want to compare it to a placebo.

  20. JH

    They base decisions on what did not occur.
    It doesn’t have to be that way. It can be done the way you have done in this post https://www.wmbriggs.com/blog/?p=9088&cpage=1. Admit it or not, you’ve used the reasoning behind the hypothesis testing to make some inferences.

    Fans of p-values are strongly tempted to this fallacy.

    Whether you adopt logical probability or Bayesianism or something else, you must still abandon p-values.

    Q2:Can you show your readers how one can use logical probability or Bayesianism to analyze real data? I can lend you a nicely cleaned up temperature time series data set.

    Briggs, you have been repeatedly saying the same thing about p-value. How about showing your solutions to Q1 and Q2 to your readers? Perhaps, Lucia would show up again. I think she would love to see your analysis of temperature data.

  21. Francsois

    Good one JH. If not p-values, what should one use?

  22. An Engineer

    I am entertained and educated by absorbing the post and comments. This place is what the Internet ought to aspire to be.

  23. Briggs

    All,

    I see we’re tempted to embrace the final fallacy, which may also be stated, “I only know how to do the wrong thing, so the wrong thing is what I shall do.”

  24. John Shade

    I quite like p-values. I like the sense they give of weighing just how convincing some set of data might be, in isolation,in support of a theory. Failing to reject the null (or rejecting it for that matter), is a formality that is perhaps made too much of, but having an assessment of how convincing some data is by itself seems useful to me. A ‘high p’ does not mean your theory is wrong, it merely means that using one particular test, using one particular data set and nothing else, the case for it is not convincing. Modest enough, and the main benefit is that it might encourage the search for better data, for a better experiment, and so on to more enlightenment. What’s not to like about that? Please be gentle with my innocence …

  25. John Shade -There is nothing wrong with what you are asking if the real truth is what you are looking for. The problem is that in many of the arguments involving statistical data each side strives to pile up statistics that support their own particular pre-conceived dogma. If both sides could suspend judgement and work together to assess the data in support of a theory the search for enlightenment would be much more efficient. However, taking clues from the current debates in congress where the participants continually fling conflicting polls and statistics in each other’s faces the probability of that kind of cooperation is very low. Suspending judgement about dogma in search of truth by the way, was the original Greek meaning of skepticism in the time of Sextus Empiricus.

  26. SteveBrooklineMA

    A friend hands me a coin, and I make a valiant effort to flip it 1000 times as fairly as I possibly can. It comes up heads 1000 times. I hypothesize that it is a fair coin… but reject that hypothesis because the chance of seeing such an extreme result given a fair coin is minuscule.

    If you think that’s reasonable, then welcome to the land of p-values!

  27. Steve, speaking as a physicist, I would look at both sides of the coin…i.e. experiment uber alles.

  28. Nullius in Verba

    “If you think that’s reasonable, then welcome to the land of p-values!”

    It’s reasonable given a couple of extra assumptions: that the prior probability of being given a biased coin is much greater than 2^-1000, and that the probability of a biased coin giving 1000 heads in a row is comparatively large.

    What an experiment like this does is to *shift* your confidence in the hypothesis (versus an alternative) from your prior belief to your posterior belief. A Bayesian would say the size of the jump is log(P(Obs|H_0)/P(Obs|H_1)), where confidence is represented by log-likelihood. The evidence *adds* or *subtracts* a fixed amount of confidence – but because nobody specifies where you start, the experiment alone doesn’t tell you where you end up.

    More precisely:
    log[P(H_0|Obs)/P(H_1|Obs)] = log[P(H_0)/P(H_1)] + log[P(Obs|H_0)/P(Obs|H_1)]
    or
    Posterior confidence = prior confidence + evidence.

    The p-value is the numerator of that expression p(Obs|H_0), the probability of the observation given that the null hypothesis is true, and when it is very small, and in particular is much smaller than P(Obs|H_1), the probability of the observation given the alternative hypothesis, then the confidence in the null hypothesis drops significantly. The 2^-1000 results in a very big drop, so you’d have to start off with a pretty extreme belief in the coin’s fairness for this evidence not to persuade you.

    However, the main point of this argument is that the journey is not the destination, and just because you have subtracted a large number from your initial confidence doesn’t *necessarily* mean that you’re now somewhere deep in the negative numbers. It depends on where you started, and it depends too on what the alternatives are.

    p-values are a rough measure of the weight of evidence. You assume that there are always alternative hypotheses that can potentially explain the result quite easily, so P(Obs|H_1) is somewhere close to 1, and then you approximate the Bayesian P(Obs|H_0)/P(Obs|H_1) as P(Obs|H_0) and treat it as a more easily calculable metric of how big a jump in confidence you should make.

    A low p-value, below 5% say, then just tells you that this is a piece of evidence worth paying attention to, that ought to make you revise your confidence in H_0. It doesn’t tell you what conclusion you therefore ought to conclude is true. An extraordinary claim still requires extraordinary evidence.

  29. There’s nothing wrong with p-values any more than with Popeye. They is what they is and that’s that. To blame them for their own abuse is just a pale version of blaming any other victim.

    But if you are the kind of pervert who really enjoys abuse here goes:
    Let H0 be the claim that z=N(0,1) and let r=1/z.
    Then P(|r|>20)=P(|z|<.05)=approx.04<.05
    So if z is within .05 of 0 then the p-value for r is less than .05
    and so at the 95% confidence level we must reject the hypothesis that mean(z)=0.

  30. Briggs

    All,

    A common-ish response is “There’s nothing really wrong with p-values, it’s the way you use them, etc.”

    This odd because there are several proofs showing there just are many things wrong with them. Particularly that their use is always fallacious.

    Is using a fallacious argument knowing it is fallacious a call for help?

  31. SteveBrooklineMA

    I agree with what Nullius in Verba is saying, I think. I might consider taking as H1 the hypothesis that p is the maximum likelihood estimate #heads/#flips, with H0 being p=1/2. If we assume a prior where H0 and H1 are equal, then calculate the posterior, we could reject H0 if the posterior for H0 is less than .05 times the posterior of H1. Alternatively, we could figure out how many times smaller the prior for H1 would have to be to make the posterior of H0 and H1 equal, and reject H0 if the prior for H1 would have to be less than .05 times the prior for H0. This isn’t the same thing as p-value<.05 hypothesis testing, but I'm pretty sure you would end up with similar. Since .05 is completely arbitrary anyway, it seems like 6 of one, 1/2 a dozen of the other to me.

  32. SteveBrooklineMA

    Allan Cooper is right, of course. This sort of p-value hypothesis testing only works when the concept of “extreme” is meaningful and in some sense implies “unlikely” due to a tail of the distribution at the extreme.

    If f(exp(i(t-t0))) for t in [0,2pi] is a family of probability densities on the circle parameterized by t0, then we could use data to test a t0=0 hypothesis (assuming f is known). But f need not have any “tail” and it’s not clear what “extreme” would mean on the circle.

  33. Nullius in Verba

    “Is using a fallacious argument knowing it is fallacious a call for help?”

    But we don’t consider it fallacious because we don’t agree that your proofs prove it to be so.

    You could think of it as a sort of Bayesian belief, in which the ‘probability’ of any particular statement being true, like “p-values are bunk”, depends on what you know, and what your priors are. There is no absolute probability, only a conditional probability depending on your assumptions and models and so on. You’re not thinking that your proofs are absolute quantities, the same for everyone, are you? 🙂

    P-values measure (approximately) the weight of evidence gained from an experiment or set of observations, and they’re generally fine as a simple heuristic when used for that purpose. Some people incorrectly think they measure the confidence you ought to have in the hypothesis as a result of the experiment, and I’ve got no problem with other people correcting them. But saying “you’re using it wrongly” is not the same thing as trying to claim that every possible use of it is wrong.

    your first argument fails on the “disjunction” point. Your second is a description of the common misunderstanding. Your third describes a correct interpretation but complains that it doesn’t match the misunderstanding. Your fourth is not a problem – different prior choices of statistic are fine, and different post hoc choices are bad for a different reason – that independence assumptions are violated. Your sixth applies to any attempt to interpret evidence, and is easily fixed by including the model in the hypothesis. The seventh is about a *different* misunderstanding – that a more sensitive test detecting smaller effects still calls them “significant”, which people wrongly interpret as meaning there was still a big effect. Your eighth is just another complain about people not knowing the correct definition. The ninth seems to be complaining that people misinterpret model parameters too. The tenth looks like the logic is scrambled – are you sure you quoted it correctly? The eleventh (?) is not presenting any problem with p-values, and your twelfth you leave for another day.

    Twelve arguments, and not one of them is convincing! So why should we think p-values are fallacious?

    However, I came to the conclusion some time ago that our opinions differ fundamentally and are unlikely to be resolved, so I let it pass. There’s no point in arguing, unless it allows for an interesting or entertaining digression along the way. I do find your point of view fascinating though, and I hope you’ll continue to promote it and expound on it. Just don’t mind us if we don’t all actually change our ways!

  34. Briggs

    Alan,

    Unless it says “Nothing”, no thank you. Have you embraced the “They’re good if you’re careful” fallacy? Their use, as proved above, is always a fallacy (unless you’re making decisions about p-values per se, which nobody does).

  35. ozero91

    “A logical disjunction is two predicates connected by the word “or”. It sure looks like a disjunction to me!”

    If this link below is to be believed, I don’t think this quote is necessarily true.

    http://philosophy.lander.edu/logic/conjunct.html

    It seems as though “The dog is dead or the dog is not dead” is not a disjunction. It appears Briggs claims the sentence in question is not a disjuntion because, like my example, “The null hypothesis is false” and “the p-value has attained by chance an exceptionally low value” cannot both be true/cannot both be false. If one is false, the other is true and vice-versa. I think this is because of the “by chance” part since it implies a Type I error. If “The null hypothesis is false” is true, then a Type I error could not have occurred.

  36. Briggs

    Ozero91,

    Very well. Not a valid disjunction, then.

    Here’s another invalid disjunction: “Either ozero91’s comment is true or vegetables grow on Mars.” Two propositions connected by a “or”.

Leave a Reply

Your email address will not be published. Required fields are marked *