Why So Many (Medical) Studies Based On Statistics Are Wrong

This was inspired by the (unfortunately titled) article Lies, Damned Lies, and Medical Science, publishing in this month’s Atlantic (thanks A&LD!).

The article profiles the work of John Ioannidis, who has spent a career trying to show the world that the majority of peer-reviewed medical research is wrong, misleading, or of little use. Ioannidis “charges that as much as 90 percent of the published medical information that doctors rely on is flawed…he worries that the field of medical research is so pervasively flawed, and so riddled with conflicts of interest, that it might be chronically resistant to change—or even to publicly admitting that there’s a problem.”

“The studies were biased,” he says. “Sometimes they were overtly biased. Sometimes it was difficult to see the bias, but it was there.” Researchers headed into their studies wanting certain results—and, lo and behold, they were getting them. We think of the scientific process as being objective, rigorous, and even ruthless in separating out what is true from what we merely wish to be true, but in fact it’s easy to manipulate results, even unintentionally or unconsciously. “At every step in the process, there is room to distort results, a way to make a stronger claim or to select what is going to be concluded,” says Ioannidis. “There is an intellectual conflict of interest that pressures researchers to find whatever it is that is most likely to get them funded.”

Most medical studies—and most studies in other fields—rely on statistical models as primary evidence. The problem is that the way these statistical models are used is deeply flawed. That is, the problem is not really with the models themselves. The models are imperfect, but the errors in their construction are minimal. And since (academic) statisticians care primarily about how models are constructed (i.e. the mathematics), the system of training in statistics concentrates almost solely on model construction; thus, the flaw in the use of models is rarely apparent.

Without peering into the mathematical guts, here is how statistical studies actually work:

  1. Data are gathered in the hopes of proving a cherished hypothesis.
  2. A statistical model is selected from a toolbox which contains an enormous number of models, yet it is usually the hammer, or “regression”, that is invariably pulled out.
  3. The model is then fit to the data. That is, the model has various drawstrings and cinches that can be used to tighten itself around the data, in much the same way a bathing suit is made to form-fit around a Victoria’s Secret model.
  4. And to continue the swimsuit modeling analogy, the closer this data can be made to fit, the more beautiful the results are said to be. That is, the closer the data can be made to fit to the statistical model, the more confident that a researcher is that his cherished hypothesis is right.
  5. If the fit of the data (swimsuit) on the model is eye popping enough, the results are published in a journal, which is mailed to subscribers in a brown paper wrapper. In certain cases, press releases are disseminated showing the model’s beauty to the world.

Despite the facetiousness, this is it: statistics really does work this way, from start to finish. What matters most, is the fit of the data to the model. That fit really is taken as evidence that the hypothesis is true.

But this is silly. At some point in their careers, all statisticians learn the mathematical “secret” that any set of data can be made to fit some model perfectly. Our toolbox contains more than enough candidate models, and one can always be found that fits to the desired, publishable tightness.

And still this wouldn’t be wrong, except that after the fit is made, the statistician and researcher stop. They should not!

Consider physics, a field which has far fewer problems than medicine. Data and models abound in physics, too. But after the fit is made, the model is used to predict brand new data, data nobody has yet seen; data, therefore, that is not as subject to researcher control or bias. Physics advances because it makes testable, verifiable predictions.

Fields that make use of statistics rarely make predictions with their models. The fit is all. Since any data can fit some model, it is no surprise when any data does fit some model. That is why so many results that use statistical models as primary evidence later turn out to be wrong. The researchers were looking in the wrong direction: to the past, when the should have been looking to the future.

This isn’t noticed because the published results are first filtered through people who practice statistics in just the same way.

Though scientists and science journalists are constantly talking up the value of the peer-review process, researchers admit among themselves that biased, erroneous, and even blatantly fraudulent studies easily slip through it. Nature, the grande dame of science journals, stated in a 2006 editorial, “Scientists understand that peer review per se provides only a minimal assurance of quality, and that the public conception of peer review as a stamp of authentication is far from the truth.” What’s more, the peer-review process often pressures researchers to shy away from striking out in genuinely new directions, and instead to build on the findings of their colleagues (that is, their potential reviewers) in ways that only seem like breakthroughs…

Except, of course, for studies which examine the influence of climate change, or for other studies which are in politically favorable fields: stem cell research, AIDS research, drug trials by pharmaceuticals, “gaps” in various sociological demographics, and on and on. Those are all OK.

Incidentally, predictions can be made from statistical models, just like in physics. It’s just that nobody does it. Partly this is because of expensive (twice as much data has to be collected), but mostly it’s because researchers wouldn’t like it. After all, they’d spend a lot of time showing what they wanted to believe is wrong. And who wants to do that?

27 Comments

  1. dearieme

    My own dictum is:
    “All medical research is wrong” is a better approximation to the truth than almost all medical research.

  2. Stuart Buck

    What about the practice of splitting up your data, fitting a model to half of it, and then testing how well it predicts the other half?

  3. Briggs

    Stuart Buck,

    A lovely idea. But nobody that I know can resist tweaking their model after seeing how it fits on the other half. Usually the process is iterated many times so that the model comes to fit the “other half” nicely, too.

    This process of course is really no different from just fitting the model on all the data.

    Plus, there is the problem that the data you collected is limited, whereas most people believe their models accurately explain/predict data of wider scope. In a medical trial, you might have one hospital supplying patients. Are those patients truly representative of all human beings, now and forever? Or are do they only represent that hospital during the very short period of time the study was run?

  4. Gary P

    A guy I work with found an even more fun way to fool himself. We are working on a quick test for a finished product and the test has a lot of variation but it actually seems a little better than the production test. In order to show better numbers he actually started adding data on the materials used to make the product. He threw the whole mess into a statistical package meat grinder and came up with amazing correlations (on the original data only!)

  5. JH

    If a swimsuit represents a statistical tool, perhaps the problem is that almost every swimsuit will look good to men (statistics practitioners) as long as there is a significant woman in it. Oh, my, I’m talking nonsense!

    Why So Many (Medical) Studies Based On Statistics Are Wrong

    My guess is that some practitioners don’t know what they are doing. Let’s double click on the SAS or SPSS icon, click, click, and click. There, some statistical results are spit out. Oooops, the results are not what we want. Another three clicks on different choices later, darn, still not what we want. Let’s delete a couple of data points. Clickclickclick, yeah, statistically significant! Done.

    Of course, I am not one of the coauthors of those studies, which could be the reason they are wrong! ^_^

  6. DAV

    JH,

    one wonders what characteristics distinguish between a significant and an insignificant woman. Perhaps and example would help. Which of the following, for instance, are significant: Twiggy, Mama Cass, Michelle Obama, Bonnie (of Bonnie & Clyde), Paris Hilton, Mother Theresa, Olive Oyl, Betty Boop?

  7. StephenPickering

    As toxicology is related to medicine, I offer the following observation.

    A recent toxicological study on lead (Pb) in food by the European Food Safety Authority (a governmental body) states that epidemiological data (dose-response data for chronic kidney disease) provided little or no evidence for the existence of thresholds i.e. there is no safe dose. EFSA analysed the data using the US EPA software BMDS 2.1.1 which is a package of 7 data fitting models. EFSA applied them all and simply decided to use the one with the best goodness-of-fit.

    And indeed the best-fitting models (i.e. Weibull, Log-probit, Multistage) all provide a curve that passes through the origin of the dose-response curve – hence no safe dose. Yet even a cursory inspection of the data fit shows that the curves do not fit the data at all well. In fact, the curves look as though they are constrained to go through zero. I re-plotted the data (there are only 4 points) using Excel and found a log curve fitted the data very much better (R2=0.9988). This data fit also suggests that there is a safe dose.

    It looks as though some people would rather believe the output of a fancy software package than the evidence of their eyes.

  8. DAV

    StephenPickering,

    There’s not much funnier than answers looking for problems. Or rather it would be funny if things such as these weren’t used for nefarious control.

    Common sense says that everything has a safe dosage. Name one chemical that contact with a single molecule of it is fatal or highly detrimental. OTOH, nearly every substance has a fatal dosage, for example, water. “Dosage makes the poison” is a long standing concept. But then, that idea arose before statistics (as we know it) came to be used to prove the obvious wrong.

  9. Benjamin Kuipers

    In Machine Learning (ML), the first commandment is that training and testing data must be kept scrupulously separate, and reviewers are quite sensitive to ways that testing data might be used as part of the training set. Furthermore, an early part of every ML course is a demonstration of “overfitting”: how getting *too good* a fit to the training data almost guarantees a *terrible* fit to the testing data.

    At a broader level, if your goal is to be a successful scientist, cooking the data to get a better fit to get a publication is a very short-sighted strategy. Success depends on getting other people to build on your results. When they do this, they will replicate your experiment, and if the results don’t pan out, your reputation suffers, possibly very dramatically. An intelligently ambitious scientist will try hard to *refute* his/her own hypothesis (hoping to fail), because he/she knows that others will do the same thing. If bad news is possible, you want to get it at home, where you can still do something about it.

    This gives the scientific enterprise as a whole a self-correcting dynamic. Erroneous results, whether from fraud, bias, or honest error, will be uncovered if the result is important enough for others to build on. They only way to avoid detection is to publish results that are so unimportant that no one tries to build on them. This self-correcting dynamic is responsible for the success of science over the past several centuries.

  10. Doug M

    Stephen Pickering,

    4 observations! I can find a curve that will fit any 4 observations.

    I first, I thought that this was a problem that if you throw enough shytt at the wall something will stick. If I have a dozen data sets, I should be able to find a “significant” fit between 2 of them.

    But, it is a little deeper than that. People want to see patterns. I was driving to work, and the thought occurred to me that when I am feeling optimistic about my Giants, the stock market seems to be up. The market had a great September, and the Giant fortunes really turned around in the last week of August.

    I plotted the Dow vs. the Giants distance from 1st place, and calculated the regression statistics.

    Rsquared = 60%
    T-stat = 15
    F-stat = 238

    Do, the Giants drive the market, or does the market drive the Giants, or are both could be subject to a 3rd influence, such as Baltic Freight, or Global Temperature.

    Go Giants!

  11. JH

    DAV,

    I can’t speak for you or men, they all are significant in their own way. Well, to be honest, women in swimsuits don’t interest me. I’d like to see a picture of Mother Teresa in a swimsuit though.

    Have a great weekend.

  12. harleyrider1978

    The dow market, that game has been manipulated for the last 3 years by the fed.Bernake and his lot have dumped 28 trillion into it just to keep the game alive……..now how sweet it is when all you have to do is create money/create evidence to make your grand illusion believeable to the masses………………until the lil dutch boy comes along and sees the leak in the dyke……does he plug it or go for help!

  13. harleyrider1978

    factual reality,does the emperor have clothes on,who dares tell the truth!

  14. Grumbler

    “DAV says:
    15 October 2010 at 9:56 am
    JH,

    one wonders what characteristics distinguish between a significant and an insignificant woman. Perhaps and example would help. Which of the following, for instance, are significant: Twiggy, Mama Cass, Michelle Obama, Bonnie (of Bonnie & Clyde), Paris Hilton, Mother Theresa, Olive Oyl, Betty Boop?”

    It’s only ever a choice between Wilma or Betty [Rubble].

  15. DAV

    Benjamin Kuipers,

    In the medical field, it seems no one really tries to replicate previous results. There’s not much glory in being the second “discoverer”. How much worse is it be to be third or more? Even when someone does try, it’s hard to get published. Space is at a premium and editors naturally prefer cutting edge to re-hashes. The first finding of “a good thing” is too often self-perpetuating in practice and will receive its share of citations in other research.

    Holdout and newly found data are without doubt the best bets but even then one needs to consider quality. I remember a book on neural networks [Masters 99, I think] where it was related that a “Clever Hans” effect was discovered in one of the networks created by the author. Of course, this discovery occurred at the most inopportune time — during a demo.

  16. DAV

    Grumbler,

    Pebbles could be in the running if she ever grows up.

  17. sylvain

    I’m presently studying on the history of science. Other than that course, I also read a couple of book on the subject. what is the main conclusion in the history of science field is that:

    1) Science is not as pure as some want to believe.
    2) Every scientist are biased toward their preconceive conclusion.
    3) Debates are rarely civil and even often vicious between scientist.
    4) Scientist are wrong more often that they are right.
    5) The truth does not always prevail or can take a long time to do so.
    6) What is believed scientific at one moment (example: Eugenist theory) might not stand the test of time and be downgraded in the future (ex: the importance of trace gas in our atmosphere).

  18. GoneWithTheWind

    Don’t forget the infamous data dredge. That is where agendized groups search statistical data for something that seems to fit their agenda while throwing out anything that contadicts their agenda. Often these groups report their “results” with great hoopla to a bowing media that screams the bias in a headline with little or nothing in the body of the story to allow the reader to know what has just happened to them. The biased results are ballyhooed in every newspaper and every TV news show and becomes “fact” in everyone’s mind.

  19. Benjamin Kuipers

    DAV: Even if a replication won’t be published, I want to check a method I’m going to build on and bet my own career on. If it fails the test, I look more carefully, and that analysis might be worth publishing. Or might result in blowing the whistle, if need be. In the neural net case you cite, note that the author describes how he caught his own error! No fraud there.

    sylvain: I seriously doubt that you, or anyone else, have the data to back up the claims about relative frequencies of events in your statements 2, 3, and 4. IMHO, historians of science are far more likely than working scientists to select the cases they describe because they want to make a splashy case.

  20. Teflon93

    I recently ran into a situation at work where we were examining some metrics associated with risk management. One useful question we’d like to be able to answer is, “How much risk do we have of X occurring?” over a time horizon. X might be noncompliance with a regulatory requirement in a given area, for example.

    We built a simple model for a metric of interest for use in root cause analysis. As part of the effort, we thought it might be useful to predict the new value post-fix. Would addressing these causes improve compliance over the next year?

    It turned out one of the executives ran away in horror from the very thought. “We’re not predicting anything!”

    Which makes me wonder how risk management works, you know?

  21. Dr Briggs: I enjoy your BLOG with its careful evaluations and witticisms. It is the first BLOG I click on after WUWT. This topic of why statistics can mislead people and can be used to mislead people is timely for the many examples you gave, i.e. global warming. I remember an example submitted to me by one of my students who got carried away in using uncertainty limits in statistical study on the tensile strength of a set of steel tensile test measurements. As often happens when a young student evaluates the experimental results using computer statistical software, they come with nonsensical results. My student had carried out an uncertainty prediction to 9sigma. I don’t know why he chose that range of the error band. He concluded that the steel was inferior because at 9sigma the lower bound value of the strength exceeded the effect of gravity. It would break under its own weight. I asked how many tests would it take to prove that result. I never got an answer. Conclusions about value and quality and the merits of statistical quality control have invited similar errors that fly in the face of reality.
    I think a reason that some scientists rely on the beauty of fit of a correlation is that the promise of additional research funding to verify predictions outweighs the quest for truth in the beauty of the fit. “If it fits you must acquit”. I wonder what effect on future grants for research would be, if after making a correlation the scientist had to prove that their correlation predicted results of a new set experimental data before they could apply for a renewal.
    I am elated that you singled out physics as a field usually interested in predictive results. While I was in attendance at Case Institute I had a chance to see the laboratory of the famous Michelson-Morley experiment to prove that light traveled in medium, aether. Of course they obtained a negative result after a long of time and a lot of very hard work. Scientists were striving to disprove the constancy of the speed of light clear into the late 1920’s from their null result in 1887. I’ll bet someone could find a statistical model that would prove them correct by fitting their data.
    Fifty years ago when I was employed in the nuclear energy business, we used a technique which I was told was very reliable way of estimating the consequences of a risk called the Adelphi Risk Method developed at UCLA. In this method experts are consulted for their opinion about the probability that an unplanned event can occur, i.e., a stuck nuclear fuel rod. Based the consultants opinions, an average probability is determined and used to estimate risk. Have you ever encountered such a methodology? I have long since lost any references to the method. I think UCLA gave it Jane Fonda.
    Your students do not know how lucky, (probability = 1) they are to have you as a teacher in statistical methods and critical reasoning about what it means.

  22. John Galt

    Good morning,Mr. Briggs,

    Good post. I work in aerospace. We’re usually pretty conservative, reality seems to rear it’s ugly head on a regular basis. Our problem is usually the lack of enough data. I remember once we had to determine the mechanical alignment error for a missile, installed on an aircraft. This was fairly critical, the missile is launched blind and data linked the location of the target for most of it’s flight. Only near the end of it’s flight, does it aquire the target with it’s own RADAR. Any alignment error is then multiplied by the length of blind flight. We were budgeted to measure the alignment on four aircraft. Not much of a sample size. Even then, we messed it up. On later review, I discovered that the systematic errors were not isolated from the random errors. We had used the total variance in computing three sigma. It turned out that half the total error was a repeatable tooling error, not random. It fooled the analysis engineer because it didn’t appear as pitch, roll and yaw, it was left/right symetrical. It was obvious if you looked at the data as caster, camber and toe out. It turned out the the larger error was still acceptable, the missile would still find the target.

  23. This is very interesting, as we may well be seeing a transition of climate science into being like medical science as far as the importance of statistics (the discipline) is concerned. Statistics is important for medical research to help protect us, and researchers, from assertions insufficiently supported by test and experiment. These assertions will be commonplace since we all have an interest in our health, may even have our own theories about it, and are willing to spend money to do things about it. We all have an interest in the climate, may even have our own theories about it, and are, it would seem, at least willing to be taxed and otherwise interfered with in order to ‘do things about it’.

    But where are the agencies and protocols and other statistically-sophisticated constraints on what can be claimed and sold as good for us with regard to climate? At one time, here in the UK, one might have turned to the Met Office for an impartial view, or even have commissioned it (if ‘one’ were the government) to give things a good review ahead of any major expenditure. But not now. The governor of the UK Met Office moved there after having been in large part responsible for leading the WWF away from the birds and the beasts and into climate campaigning – an activity full of injunctions and ‘things to do about it’ (http://www.englishpartnerships.co.uk/robertnapier.htm). One might have turned to Imperial College, but only to find there the Grantham Institute, created by an American financier keen to ‘do things about it’, and whose London institute is the current resting place of PR specialist for CO2-alarmism, one Bob Ward (previously holding the same sort of remit for the Royal Society) (http://bishophill.squarespace.com/blog/2010/9/29/pielke-jnr-on-bob.html). So where can one turn?

    It seems likely that assertions about climate, and the perverse unpredictability of near-term (next few hundred years, say) climate, will both continue to be with us for some time to come. Surely we, the poor suffering taxpaying public, need some kind of FDA to protect us from the likes of the IPCC? A building, perhaps to be named the ‘McIntyre Building’, containing many statisticians dedicated to reviewing with the skill and persistence of the potentially eponymous McIntyre, the claims and the testability of the assertions of those ostensibly troubled by the presence of more CO2 in the atmosphere.

  24. bill r

    I’ve always like Ehrenberg’s take on law-like relationships:

    A result can be regarded as routinely predictable when it has recurred consistently under a known range of different conditions. This depends on the previous analysis of many sets of data, drawn from different populations. There is no such basis of extensive experience when a prediction is derived from the analysis of only a single set of data. Yet that is what is mainly discussed in our statistical texts….

    from Ehrenberg and Pounds, JRSS-A, 1993. A freely distributable copy of his book Data Reduction is available here.

Leave a Reply

Your email address will not be published. Required fields are marked *