Another Reason To Abandon P-values; Or Another Way To Cheat

If you’re a scientist, soft or hard, who routinely uses statistics, it’s likely that your funding, and therefore your career, the very wellness of your being hinges on discovering statistically significant results. Small p-values, that is.

Sometimes, however, the data just won’t cooperate. You conjured a theory that just must be true, set up your experiment, and collected your data. You ran your test and—drat!—a p-value larger than the publishable limit!

Do you despair? Do you, God forbid, abandon your oh-so-beautiful theory? Do you chuck it all and even change your mind?

No, sir! Friend, I’m here to tell you there is hope. There is a cure for what ails you. There is sunshine waiting. Introducing the Sample Size Extender™, the only frequentist statistical procedure guaranteed—did he say guaranteed?—to produce a publishable p-value each and every time.

As long as your funding holds the Sample Size Extender™ will provide you a small p-value or we will double your money back.

The Sample Size Extender™ is so easy a child could use it. Here’s how it works. All you need is a theory: any will do, the wilder the better. Begin collecting your data, put them through the works one at a time, then just wait for the publishable p-value, which is sure to come.

Here’s an example. Your theory supposes that a certain measure will be different between two groups. You’ll confirm this using a t-test—though any test will do. You begin by measuring two people, one from each group1. Put the measurements of this pair into the t-test and check whether your p-value is less than the magic number. If it is: stop! Your theory is proven. Go forth and write your paper.

But if the p-value is not small enough, measure two more people, one more from each group. Then run the augmented sample through the t-test and check for a publishable p-value again.

Iterate the Sample Size Extender™ and you will always—absolutely always—find a p-value less than the magic number. Yes, sir, friends: this method is foolproof. Fools prove it every day!

Now I know you’re doubtful, neighbor. I know you don’t believe. Why you should trust old Honest Matt? He’s trying to sell you something! Friend, I’m not asking you to believe. I’m not asking for your trust. I want you to convince yourself. I want you to see the truth with your own eyes.

Look and behold!

P-values will kill you

The picture is the Sample Size Extender™ in action. What do you suppose would happen if you grab two numbers from the air, numbers which are fiction, which are real as bigfoot, numbers which have no relation to one another? Would a t-test based on these numbers give you a small p-value? It might, friend, it might. But not often. It’s supposed to only happen one out of every twenty attempts.

But with the Sample Size Extender™ it can always happen.

What I did was to grab numbers from the air, one pair at a time and run them through a t-test. If the first pair through the t-test gave joy, then I stopped. If not, then I added another pair and checked again. I kept doing this until I got a publishable p-value, and then I noted how many pairs it took. That’s the sample size.

Then I did the whole thing over. I started with one pair, then two, and so on. I again noted how many pairs (up to 1000) it took to get a publishable p-value. I did the whole procedure 500 times and plotted up the results.

Over 10% of the time it only took five or fewer pairs to prove my theory—which is no theory at all! Remember these are entirely made up numbers! How much easier will it be, friend, to prove your theory which must be true!

Talk about simple, neighbor. Talk about progress. Talk about limitless possibilities! Why, 20% of the time, it took only ten or fewer pairs to prove Nothing. Forty-percent of the time it only takes less than 100 pairs. How little is needed to do so much!

This is statistics, friend. This is what it’s all about. No need to use that tired old phrase “More research is needed” when research each and every time will prove what you want—but only if you use the Sample Size Extender™.

The sharp-eyed among you will note the strangeness at the end of the graph, the spike at 1000. Well, friends, this is where I tuckered out. I could have gone on generating pairs beyond 1000 and—you’ll have to trust me now—eventually these experiments would all give me a small p-value. You’ll always get one, as long as you can keep taking measurements.

What’s that I hear? You can’t afford to take measurements indefinitely? Are you sure you can’t get a bigger grant? No? Then let me tell you about the miracle of our patented Sub-Group Analyzer™…

————————————————————————————————–

1Yes, sourpuss, technically you need to start with four people for a t-test. Yet another reason to make the switch to Bayes!

This post was inspired by reader Andrew Kennett and his link to the article Re-examining Significant Research: The Problem of False-Positives .

Comments

Another Reason To Abandon P-values; Or Another Way To Cheat — 23 Comments

  1. Matt asked, “What’s that I hear? You can’t afford to take measurements indefinitely?”

    This is why the most advanced scientists use computer models instead of dirty, filthy, smelly rats or patients. Unlimited supply and they run all night for (almost) free. Numbers not good enough? Come back tomorrow. We’ll have more.

  2. Frederick Mosteller said once that it is easy to lie with statistics, but it is easier to lie without them. Anybody can do tricks like that, but it does not make them statistically valid (although sequential sampling is possible, if one adjusts well for multiplicity issues).

    Does the fact that it is possible to perform dirty tricks with (so called frequentist) statistics justifies switching to Bayesian statistics? I am nor that is the case. I know that there are better and more sophisticated arguments to do that (and I do not want to go debating on that, not at this moment anyway. I do agree that there are cases in which using Bayesian statistics makes sense).

    Just curious to know – are you absolutely sure that there aren’t any dirty tricks out there that can abuse Bayesian statistics?

  3. Joe,

    There are plenty of fantastic ways to cheat with Bayes, some of them subtle and lovely. But there are more ways to cheat with frequentist stats, mostly because so many more people use it and because of p-values. That’s what makes it so much fun!

    You can get away with more when talking about parameters and p-values, less when talking about parameter posteriors, and still less using predictive stats. But you can cheat with all. Any use of statistics should always be suspect.

  4. You might like this paper:

    False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant.
    http://www.ncbi.nlm.nih.gov/pubmed/22006061

    (I saw this on http://blogs.discovermagazine.com/gnxp/2011/11/the-problem-of-false-positives/)

    From the paper:

    We propose the following six requirements for authors.

    Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calculations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

    Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification. This requirement offers extra protection for the first requirement. Samples smaller than 20 per cell are simply not powerful enough to detect most effects, and so there is usually no good reason to decide in advance to collect such a small number of observations. Smaller samples, it follows, are much more likely to reflect interim data analysis and a flexible termination rule. In addition, as Figure 1shows, larger minimum sample sizes can lessen the impact of violating Requirement 1.

    Authors must list all variables collected in a study. This requirement prevents researchers from reporting only a convenient subset of the many measures that were collected, allowing readers and reviewers to easily identify possible researcher degrees of freedom. Because authors are required to just list those variables rather than describe them in detail, this requirement increases the length of an article by only a few words per otherwise shrouded variable. We encourage authors to begin the list with “only,” to assure readers that the list is exhaustive (e.g., “participants reported only their age and gender”).

    Authors must report all experimental conditions, including failed manipulations. This requirement prevents authors from selectively choosing only to report the condition comparisons that yield results that are consistent with their hypothesis. As with the previous requirement, we encourage authors to include the word “only” (e.g., “participants were randomly assigned to one of only three conditions”).

    If observations are eliminated, authors must also report what the statistical results are if those observations are included. This requirement makes transparent the extent to which a finding is reliant on the exclusion of observations, puts appropriate pressure on authors to justify the elimination of data, and encourages reviewers to explicitly consider whether such exclusions are warranted. Correctly interpreting a finding may require some data exclusions; this requirement is merely designed to draw attention to those results that hinge on ex post decisions about which data to exclude.

    If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate. Reporting covariate-free results makes transparent the extent to which a finding is reliant on the presence of a covariate, puts appropriate pressure on authors to justify the use of the covariate, and encourages reviewers to consider whether including it is warranted. Some findings may be persuasive even if covariates are required for their detection, but one should place greater scrutiny on results that do hinge on covariates despite random assignment.

    We propose the following four guidelines for reviewers.

    Reviewers should ensure that authors follow the requirements. Review teams are the gatekeepers of the scientific community, and they should encourage authors not only to rule out alternative explanations, but also to more convincingly demonstrate that their findings are not due to chance alone. This means prioritizing transparency over tidiness; if a wonderful study is partially marred by a peculiar exclusion or an inconsistent condition, those imperfections should be retained. If reviewers require authors to follow these requirements, they will.

    Reviewers should be more tolerant of imperfections in results. One reason researchers exploit researcher degrees of freedom is the unreasonable expectation we often impose as reviewers for every data pattern to be (significantly) as predicted. Underpowered studies with perfect results are the ones that should invite extra scrutiny.

    Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions. Even if authors follow all of our guidelines, they will necessarily still face arbitrary decisions. For example, should they subtract the baseline measure of the dependent variable from the final result or should they use the baseline measure as a covariate? When there is no obviously correct way to answer questions like this, the reviewer should ask for alternatives. For example, reviewer reports might include questions such as, “Do the results also hold if the baseline measure is instead used as a covariate?” Similarly, reviewers should ensure that arbitrary decisions are used consistently across studies (e.g., “Do the results hold for Study 3 if gender is entered as a covariate, as was done in Study 2?”).5 If a result holds only for one arbitrary specification, then everyone involved has learned a great deal about the robustness (or lack thereof) of the effect.

    If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication. If a reviewer is not persuaded by the justifications for a given researcher degree of freedom or the results from a robustness check, the reviewer should ask the author to conduct an exact replication of the study and its analysis. We realize that this is a costly solution, and it should be used selectively; however, “never” is too selective.

  5. Another article that includes the Simmons paper.
    Fraud Scandal Fuels Debate Over Practices of Social Psychology

    … What the researchers omitted, as they went on to explain in the rest of the paper, was just how many variables they poked and prodded before sheer chance threw up a headline-making result—a clearly false headline-making result.

    … rather than establish from the outset how many subjects they would test, they tested until they obtained the false result.

    They even mention of Briggs’ favorite psychologists, Daryl Bem.

  6. Let’s hope that this paper does not become a “how to” guide in psychology. I think part of the problem is that if others are massaging the data, and nobody is interested in lower p-values in unpublished papers, you might as well join in. The more people join in, the worse it gets.

  7. To demonstrate if certain songs can change listeners’ age, which we know is false, the authors ask “how old do you feel right now?” after listening to the songs. Hmm…

    I feel like a billionaire! ^_^

    The simulations are also just as interesting… uh… at least not well explained! The problem is not new at all, and it’s known that the mentioned sampling method is not valid. And there are statistical techniques for analyzing data sequentially. Google “sequential trials.”

    Misleading people with the abuse of statistics and misinterpretation of p-values don’t imply that there are more ways to cheat with frequentist statistics. Imagine the number of possible prior distributions.

  8. JH,

    Fiddle faddle. Even if you used “sequential” statistics, you’d still be left with a p-value in the end. And what’s that? What is the definition of one?

    Speed/Genemachine,

    Great links!

  9. What is the definition of p-value? Google it.

    I would still suggest looking at the statistical significance. The conclusion of statistical significance is often associated with a dichotomous decision making. One might or might not be able to select a better treatment. Or I would be on the alert if, e.g., my dad’s average glucose level for the past month has a large standardized score. Lack of statistical significance might mean that further information might be needed.

    How about helping your readers understand how they can use a Bayesian method to make a dichotomous decision?

  10. JH,

    Good idea! I googled “definition of p-value” and the first site that came up was Wikipedia. According to them, “the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.”

    This is false. Now what?

    To answer your second question, the one of real importance. Bayes (in a predictive sense) will give your dad the actual probability that (say) this treatment will work. It will not give him a p-value (or posterior), which greatly, vastly over-estimates the probability the treatment will work.

    He can then take the actual probability (which still assumes the model itself is true, and that he is “like” the sample that went into the model) and use it in an decision analytic sense. Of course, the so-called “actual probability” will still be too certain in the sense that your data doesn’t know if the model is true and if he is really “like” the sample that went into the model. How much less certain nobody knows.

    Unless, of course, he’d have some good information that the model was true if it demonstrates skill (in the technical sense). Then he’d only have to guess if he was “like” the sample that went into the model and the sample used to demonstrate skill.

    Yes, the answer is: people are too certain of themselves.

  11. Why is this true, specifically for t-tests? If a t-test is supposed to give a low p-value for a significant difference in mean for two sets, ever increasing sets of random numbers with the same range should trend towards the same mean.

  12. mt
    try large scale flip of 2 (identical) coins ?

    Isn’t the whole thing a point for skeptics (Sextus)?

  13. mt,

    Excellent question. First, it has nothing to do with t-tests per se. Anything frequentist test statistic will do.

    While it’s true, as you say, the simulated numbers will converge, we are also doing a large number of tests. It’s not exactly equivalent, but think of a lottery ticket. Low probability to win, but buy enough tickets and you’ll win eventually.

  14. I don’t know about the behavior of other tests, it just seems like t-tests would be the exception that proves the rule. As the two sets get larger, it would seem less likely that there will be a sufficiently large run of biased random numbers that would move the means apart and give the small p-value. Conversely, it’s much easier to be randomly biased with small sets. It would be the opposite of the lottery, your best chance is to win early, and the more you play without winning, the harder it is to win.

  15. Wikipedia. According to them, “the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. This is false.

    What is false in the definition?

    To answer your second question, the one of real importance. Bayes (in a predictive sense) will give your dad the actual probability that (say) this treatment will work. It will not give him a p-value (or posterior), which greatly, vastly over-estimates the probability the treatment will work.

    By saying that it would give the actual probability, you are too certain of yourself! (See how easy for me to say this.) It would give you an estimated probability based on the postulated model. No one knows the actual probability.

    Give me a rule for a decision of “YES” or “NO”… in a predictive sense. How would my decision of yes-or-no based on the posterior distribution differ from the one based on the predictive distribution that depends on the posterior distribution?

    If the FDA needs to decide whether the treatment is safe and effective, should the FDA use the predictive result from my DAD only?

    So the probability value (p-value) is not analytical?

    Unless, of course, he’d have some good information that the model was true if it demonstrates skill (in the technical sense). Then he’d only have to guess if he was “like” the sample that went into the model and the sample used to demonstrate skill.

    No one has any information as to whether the model is true, but I can help him judge whether the model might be adequate though.

    How am I supposed to demonstrate if the model is skillful at the moment when I need to make a yes-or-no decision? Do I suggest that he keep taking the combination of medication for another week or month knowing the data indicate the treatment has not lowered his BGL… so I can collect new data to verify if the appropriate model has any skills?

  16. JH,

    I can say, but I do not just “say”, using the p-value is too certain because it is: it is mathematically provable. It is obvious, and true, that a small p-value does not translate into high probability that a given hypothesis is true. It is guaranteed that the p-value will overstate the evidence if any attempt is made at using the p-value to inform the probability of any decision based on an observable. This is not opinion, but a statement of deduced fact.

    I defined the “actual probability” conditional on the truth of the model and that your dad is “like” the sample used to fit the model. Given these, this is the definition of the actual probability. This is not opinion, but a statement of deduced fact.

    You’re quite wrong that “no one has any information as to whether the model is true.” We can have such information; further, we know exactly how to look for it. As Schervish showed us, models should be calibrated, for example. He showed that any model that is calibrated (on data not used to fit the model) is better than the non-calibrated model for any decision function. And as even Fisher introduced, they should have skill (in the formal sense). I have a decision rule for “YES” or “NO” questions: see my decision calculator, which factors in the things I just mentioned.

    Further, since you are a statistician and often use models, isn’t it a strange thing to say that you have no information whether the models you use are true? How do you pick a model if you don’t believe it? I hope you don’t answer “tradition” (or the like), because the infinite regress (and ultimate fallacy) will be obvious.

    Your last question also suffers from a fallacy. “I have to do something now, so why shouldn’t I use this model (with small p-value)?” That gives leave for anybody to form any model and say, “Why shouldn’t I use it, it’s all I have?”

    Update Forgot to answer your first question. Can anybody help us out here? This is a homework assignment. What is wrong with Wikipedia’s p-value definition? (And there is something non-trivial wrong.)

  17. I can define ‘JH people’ to be people who can speak Hakkanese. Can you say my definition is false? I can’t say your definition of ‘actual probability’ is false, hey, it’s your definition.

    Still, fitting a statistical model to data yields estimated values of an unknown quantity.

    Oh… there are differences between true, assumed to be true and believed to be true.

    I don’t see why my question (definitely not I) suffers the fallacy stated by you; and the question (4) is stated below again.

    I await your answers to the following.

    1) What’s false about the definition of p-value?

    2) Yes, models should and can be calibrated. A statistical model is either ‘true’ or not. So, what information would give me an answer to the question of whether a model is true?

    3) A rule for a decision of yes or not. How about using the example in which the BGL were observed daily for a month?

    4) How am I supposed to demonstrate if the model is skillful at the moment when I need to make a yes-or-no decision?

    BTW, in reality, what’s wrong ‘why shouldn’t I use it, it’s all I have (this is not suggested by me, just to be clear)’?

  18. “the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. This is false. ”

    It’s false because the p-value does not reflect the probability of the “test statistic” being drawn by random chance. It is the probability of obtaining the actual sample statistic (or regressor/co-efficient)

  19. Scott,

    Excellent, but not quite right.

    First, the “null hypothesis” is not a hypothesis about any observable, real-life event. It is a statement that some unobservable, non-existent parameters take some specified value. Second, the p-value also conditions on the truth of the model. That is, in making the statement that the p-value is this-and-such, it also assumes that the model is true and without error. How do we know that?

    We do not. We can, however, quantify the uncertainty of the model in Bayesian theory. But not in frequentist.

  20. A math definition can’t be false. Whether it’s any good is a different story, so is whether you like it or not.

    Bayesian approach quantifies the uncertainty about observables and the uncertainty in the parameter of a model assumed tobe true. Which, at least to me, is different from saying that it quantifies the uncertainty of the model as if it involves the probability of your model being true.

  21. hi to all wmbriggs.comers this is my frst post and thought i would say hi –
    regards speak again soon
    gazza

  22. Model misspecification is always an threat to internal validity. But can’t p-values and T-tests/F-tests also help with that (say by testing whether a non-linear independent variables are more statistically significant than linear variables in a regression)?
    Speaking for myself, I still think it’s a bit early to give up on p-tests.