No, the title of today’s post is not a joke, even though it has often been used that way in the past. The title was inspired by yesterday’s Wall Street Journal article “Analytical Trend Troubles Scientists.”
Thanks to the astonishing fecundity of the p-value and our ridiculous practice of reporting on the parameters of models as if those parameters represented reality, we have stories like this:
In 2010, two research teams separately analyzed data from the same U.K. patient database to see if widely prescribed osteoporosis drugs [such as fosamax] increased the risk of esophageal cancer. They came to surprisingly different conclusions.
One study, published in the Journal of the American Medical Association, found no increase in patients’ cancer risk. The second study, which ran three weeks later in the British Medical Journal, found the risk for developing cancer to be low, but doubled.
How could this be!
Each analysis applied a different methodology and neither was based on original, proprietary data. Instead, both were so-called observational studies, in which scientists often use fast computers, statistical software and large medical data sets to analyze information collected previously by others. From there, they look for correlations, such as whether a drug may trigger a worrisome side effect.
And, surprise, both found “significance.” Meaning publishable p-values below the magic number, which is the unquestioned and unquestionable 0.05. But let’s not cast aspersions on frequentist practices alone, as probelmatic as these are. The real problem is that the Love Of Theory Is The Root Of All Evil.
Yes, researchers love their statistical models too well. They cannot help thinking reality is their models. There is scarcely a researcher or statistician alive who does not hold up the parameters from his model and say, to himself and us, “These show my hypothesis is true. The certainty I have in these equals the certainty I have in reality.” Before I explain, what do other people say?
The WSJ suggests that statistics can prove opposite results simultaneously when models are used on observational studies. This is so. But it is also true that statistics can prove a hypothesis true and false with a “randomized” controlled trial, the kind of experiment we repeatedly hear is the “gold standard” of science. Randomization is a red herring: what really counts is control (see this, this, and this).
Concept 1
There are three concepts here that, while known, are little appreciated. The first is that there is nothing in the world wrong with the statistical analysis of observational data (except that different groups can use different models and come to different conclusions, as above; but this is a fixable problem). It is just that the analysis is relevant only to new data that is exactly like that taken before. This follows from the truth that all probability, hence all probability models (i.e. statistics), is conditional. The results from an observational study are statements of uncertainty conditional on the nature of the sample data used.
Suppose the database is one of human characteristics. Each of the human beings in the study have traits that are measured and a near infinite number of traits which are not measured. The collection of people which make up the study is thus characterized by both the measured traits and the unmeasured ones (which include time and place etc.; see this). Whatever conclusions you make are thus only relevant to this distribution of characteristics, and only relevant to new populations which share—exactly—this distribution of characteristics.
And what is the chance, given what we know of human behavior, that new populations will match—exactly—this distribution of characteristics? Low, baby. Which is why observational studies of humans are so miserable. But it is why, say, observational astronomical studies are so fruitful. The data taken incidentally about hard physical objects, like distant cosmological ones, is very likely to be like future data. This means that the same statistical procedures will seem to work well on some kinds of data but be utter failures on others.
Concept 2
Our second concept follows directly from the first. Even if an experiment with human beings can be controlled, it cannot be controlled exactly or precisely. There will be too many circumstances or characteristics which will remain unknown to the researcher, or the known ones will not be subject to control. As good as you can design an experiment with human beings is just not good enough such that your conclusions will be relevant to new people because again those new people will be unlike the old ones in some ways. And I mean, above and here, in ways that are probative of or relevant to the outcome, whatever that happens to be. This explains what a sociologist once said of his field, that everything is correlated with everything.
Concept 3
If you follow textbook statistics, Bayesian or frequentist, your results will be statements about your certainty in the parameters of the model you use and not about reality itself. Click on the Start Here tab and look to the articles on statistics to read about this more fully (and see this especially). And because you have a free choice in models, you can always find one which lets you be as certain about those parameters as you’d like.
But that does not mean, and it is not true, that the certainty you have in those parameters translates into the certainty you should have about reality. The certainty you have in reality must always necessarily be less, and in most cases a lot less.
The only way to tell whether the model you used is any good is to apply it to new data (i.e. never seen by you before). If it predicts that new data well, then you are allowed to be confident about reality. If it does not predict well, or you do not bother to collect statistics about predictions (which is 99.99% of all studies outsides physics, chemistry, and the other hardest of hard sciences), then you are not allowed to be confident.
Why don’t people take this attitude? It’s too costly and time consuming to do statistics the right way. Just look how long it takes and how it expensive it is to run any physics experiment (about genuinely unknown areas)! If all of science did their work as physicists must do theirs, then we would see about a 99 percent drop in papers published. Sociology would slow to a crawl. Tenure decisions would be held in semi-permanent abeyance. Grants would taper to a trickle. Assistant Deans, whose livelihoods depend on overhead, would have their jobs at risk. It would be pandemonium. Brrr. The whole thing is too painful to consider.
Amen, brother. Say it!
Frequentist or no, I have always been keenly aware that the results of any sample apply only to the population from which the sample was taken; meaning that the parent distribution is the same.
Most of my career was spent dealing with something halfway between the hard sciences and the social sciences: namely, manufacturing processes. There, our objectives quite often were a) to maintain the process in a stable distribution with the mean on target and the variation comfortably within specs; and b) to get a signal when this was no longer true. Even so, there were situations, like those dealing with raw materials, where a constant mean was unrealistic and unattainable in practice.
I wouldn’t blame the poor little p-value for the abuse it suffers at the hands of amateurs. Its conditionality should be quite clear: IF the mean and variance of the population has remained such and such, THEN the probability of obtaining this sample result is p. But I have seen even physicists and the like take it to mean that “the probability that the hypothesis is true is p.” Eek.
Briggs – Like many/most people I have struggled with statistics for years. I always wondered about p-values, and your Bayesian discussions make hypothetical sense to me when I get into a theoretical statistics-studying mode.
However, I still struggle – and I can see why people continue to use p-values – they just pick them off the shelf as statistical fast-food.
It would be useful if you could provide a simple scenario for some kind of (fairly realistic) “scientific” study where a Bayesian approach is used, with a direct comparison with the perhaps erroneous p-value version.
As a statistical idiot, I would like to use the correct techniques, but I need them in a “take-away” format!
I love the little p value. It, along with confidence intervals, allow anyone to make their graphs exciting. A single line on its own is sleep inducing, but add some nice curves of t distribution, a little < 0.05, maybe an * or two, and you've got yourself some sex appeal. Even a flat autocorrelation plot starts to look pretty hot once the CI has some sexy Gaussian displacement..
All kidding aside: Dr. Briggs, if folks actually tested their predictions then you wouldnt have anything left to write about.
According to the National Cancer Institute the cause of cancer is unknown. Since nobody knows what causes cancer, how could the researchers determine osteoporosis drugs increase the risk of cancer?
Read here. http://training.seer.cancer.gov/disease/war/
“Our slogan is: end the slavery of reification! (I’ll speak more on this another day.”
I’m eagerly waiting for Dr. Briggs to expound on this topic.
“fast computers”: oh for heaven’s sake.
Do you have the primary references for the two studies? This would make a wonderfully thought-provoking reading or analysis assignment for an undergrad statistics course.
“The only way to tell whether the model you used is any good is to apply it to new data (i.e. never seen by you before)”
I learned that the hard way. Years ago, when I was young and single, I read Andrew Byer’s book on handicapping horse races, and figured I could pick up a little extra spending cash, like Andrew Byer did in “My $50,000 Year at the Track”.
I pored through Daily Racing Forms, finding significant statistical predictors almost guaranteed to make me money. My actual result was a loss of about 9% on my wagers- less than the track take of 15%, but no better than blindly betting on favorites would have done.
I realized after the fact that looking for statistical predictors, recency of last X races, times of races run, recency of workouts, weight carried, average purse won, etc, I was BOUND to get some results with p values less than 5% just based on the number of variables I was looking at. The only way to verify that my results were significant, rather merely the result of data mining was to check the predictions on a series of future races before I wagered any more actual money.