# What’s Wrong With Hypothesis Testing: Reader Question

I received this email from long-time reader Ivin Rhyne, which is so well put that I thought we should all see it (I’m quoting with permission):

Matt,

I just got back from a conference on historical economics and was absolutely bowled over by the repeated usage of t-tests and p-values as the arbiter of whether an hypothesis is false or not. Allowing for the subtleties of “reject” vs. “unable to reject” my question is more numerical.

My personal understanding of regression analysis to test “fit” a model to the data is as follows:

1. Form a hypothesis

2. Gather some data to test your hypothesis

3. Translate your hypothesis into a form that is mathematically testable (let’s assume OLS regression is a good mathematical expression of your original hypothesis)

4. Using part of your data, calibrate the OLS by running it to get some numerical parameters that then become an intrinsic part of your hypothesis

5. Using the rest of your data (the part you DIDN’T use to calibrate the model) you actually insert the data points for the independent variables and then compare how closely the dependent variable matched the actual values.ALL of the papers presented at the conference stopped at step 4. Their test of the hypothesis was simply whether the model could “calibrate” to the data in a way that generated coefficients that had “acceptable” p-values.

My questions are as follows:

1. Have I missed something and in fact theirs is the correct approach to hypothesis testing of social science data?

2. If I am right (in principle) about how to test hypotheses, can you point me toward (or perhaps even better lay out in your blog) what kind of test is appropriate for step 5 described above for an OLS regression?

As always, I appreciate your insights.

Ivin

This nails it. I have rarely seen a sociological or other “soft science” paper venture beyond Step 4. A few make a stab at Step 5, but usually in such a way as to dissolve the force of this Step.

It’s cheating, really, and done by formulating several models, usually the same underlying OLS but with different sets of “regressors” for each model, and then each is tested (crudely) via a Step 5. The one that’s best, or the one that is best within the subset matching an author’s desires, is the one that makes its way to print.

I hope you can see that doing this is just the same as skipping Step 5. Or it’s equivalent with Steps 1 – 4, but with a “meta” model. Regardless, it is using the data you have in hand to massage a model into a shape that is lovelier to the eye. It is *not* an independent test of your model’s goodness.

Ideally, there is no Step 5—all data you have in hand should be used to construct your model—but there should be a Step 6, which is “Wait until new data comes in and test the model *predictively*.” All physical sciences do Step 6—with the exception, perhaps, of climatology, where it’s the “seriousness of the charges” that counts.

Passing a Step 6 does not, of course, guarantee the truth of a model. Just look at the Ptolemaic systems of cycles, epicycles, semi-cycles, and so on ad infinitum. Wrong as can be, but still useful. The model passed Step 6 for centuries, which is one of the reasons few thought to question its truth. Don’t mess with what works!

From this history we learn that passing Step 6 is a necessary but not sufficient condition in ascertaining a model’s truthfulness. Spitting out a p-value (Steps 1- 4) that is less than the magic number is not even a necessary condition; and anyway, the p-value was *purposely designed not* to say anything about a model’s truthfulness.

We must remember that, for any set of evidence (data), any number of models can be made to explain that data; that is, you can always find models which fit that data. Simply touting fit—as in Steps 1 – 4, and the p-value’s main job—is thus very weak evidence for a model’s truth.

Why aren’t more “Step 6″s being done in statistics? It’s not that it’s difficult computationally, but it is expensive and time consuming. It’s expensive because it costs money to collect data. And it’s time consuming because you have to wait, wait, wait for that new data. And while you’re waiting, your wasting opportunities for “proving” new theories.

Much more to this, of course. For example, why do some models work even when people flub the steps? Because models are chosen with reference to external probative information. We’re obviously just at the beginning of a discussion.

Are weather models the only ones we treat with skepticism or are they the only ones where “step 6” is performed routinely? Every day, in fact.

Answering my own question … models for roulette, craps, poker, economics and stock markets also undergo routine “step 6” testing and are found wanting.

I think I know how to evaluate a modelâ€™s usefulness using certain criteria, but how does one evaluate its truthfulness?

Perhaps the problem is that we have to WAIT. What can we do while waiting? (I recommend taking a nap.) How much data are we supposed to wait for? How to measure and test a modelâ€™s predictability in the moment of decision? We can only use the information/data on hand.

As George W. Bush says,

â€œPerceptions are shaped by the clarity of hindsight. In the moment of decision, you don’t have that advantage.”(Goodness, I am quoting Bush!)

One may want to use predictive data mining techniques, but still, we can use only currently available information. I am more interested in how I can make the best use of existing data, under what criteria it means â€œbestâ€ and how to measure the uncertainty or accuracy of a prediction, again, based on observed data (only because I don’t know how to use unobserved data).

More importantly, before employing any statistical tools, what are the key explanatory variables?

“Just look at the Ptolemaic systems of cycles, epicycles, semi-cycles, and so on ad infinitum. ”

Now just a minute Dr. Briggs. There was a trial and the judges declared the earth was the center of the universe. Are you claiming those well educated lawyers didn’t know what they were talking about? Next you’ll be claiming the phlogiston theory is wrong.

Matt,

Thanks for the post and answer to my question. Your answer regarding a step 6 is insightful. But perhaps I failed to explain what I was trying to capture with step 5. The purpose of having a “step 5” was to test immediately whether the formulation under 1-4 had any “statistical” validity. By this I mean that the relationships between variables described in steps 1-4 and estimated explicitly in step 4 actually continues to be a valid description of the world using information that wasn’t included in the calibration step. This is essentially what Step 6 does in your construction and I wholeheartedly agree with your philosophical approach.

I also want to say that your example of the Ptolemaic model is apropos because it was a “model” that could accurately predict behaviors of natural systems but that was eventually shown to be based on completely false pretenses. The “value” of the Ptolemaic system wasn’t whether it was objectively “true” but in the fact that it could reliably predict what people saw in the sky each night. Those of us who build these models (be they economic models or models of biological diversity) must not claim the mantle of “truth” even when repeated tests of the original model’s predictions continue to be accurate.

Finally a thought that always troubles me – in an Ordinary Least Squares (OLS) regression, there is one overriding assumption that nearly always gets overlooked and makes steps 5 (and 6) a requirement; OLS assumes a stable, linear relationship between variables. It is ALWAYS possible that the data generating process is neither stable nor linear, but that over the range of data available there is a spurious result that produces “acceptable” p-values for the regression coefficients. The only way to test to see if this stable, linear relationship is present is to conduct tests using data that is outside the calibration data set.

As always, your help and comments are appreciated.

Ivin

One of the great revelations to me from the skeptics with numerical understanding was that noise could have a long period. My contests with noise had always been to get it out of my audio equipment and I’d never appreciated that high-frequency wasn’t the only possibility.

Somewhat off topic:

“Ottawa journalist Dan Gardner’s new book examines the folly of putting too much trust in people who make predictions. The book is called Future Babble: Why Expert Predictions Fail and Why We Believe Them Anyway.”

http://www.cbc.ca/video/news/audioplayer.html?clipid=1691502507

The book relies on the research of Philip E. Tetlock.

http://press.princeton.edu/titles/7959.html

The bottom line: People should run away screaming from the predictions of people like Al Gore.

Here’s a link to a story where Dan Gardner shreds David Suzuki: http://www.dangardner.ca/index.php/articles/item/49-so-is-the-world-predictable-or-not?-the-environmentalists-contradiction

So the Ptolemaic model was wrong and yet made predictions useful enough to be relied on for centuries. In what sense, then, was it wrong? If it’s formulated as, “the planets behave as if they follow epicycles around the Sun” and the resulting output corresponds closely to the actual behaviour of the planets then they’re not wrong. Form a certain point of view – “geocentric” – everything does go round the Earth.

In fact, modern astronomy’s model of the solar system is that the planet behave as if acted on by a centrally directed force proportional to their mass and the mass of the Sun and inversely to the square of their distance from it. But general relativity tells us that that isn’t ‘true’ either however useful.

Can we ever discover ‘truth’ through science and statistics or do we just get useful models? Is there such a thing as ‘proof’ or only persuasion?

Seems it’s not just me: http://www.bigquestionsonline.com/columns/michael-shermer/stephen-hawking%E2%80%99s-radical-philosophy-of-science

I firmly believe that the problem described — overreliance & belief in p- & t-tests is due to superficial training in statistics. We’re (us non-statisticians are) taught regression &, maybe, a few other curve-fitting/modeling approaches, some basic validity tests & are sent on our merry way with a cursory appreciation. What’s clearly needed is a greater mandatory depth in statistics course graduation requirements.

We’ve all seen parallels in other disciplines as well — most noteworthy is the art & science of finance. I recall having a chat, at an informal gathering of profs & students after a semester ended, with a couple of finance prof from George Wash. U. around 2000 sometime. It stands out because one [young] finance prof remarked at length & in frustration that if anybody did just a basic analysis of the sales required to support the stock valuations of any of the then-popular dot.com firms one would easily find that in most cases such sales would be impossible to achieve by wide margins — or at least not for many years. Hence the values were extremely inflated — all on emotion. Another finance prof was expounding on his emphasis on the use of off-balance-sheet financing tools (e.g. leasing, etc.) — which just happened to be the very sort of issues that were found to underlie Enron’s shennanigans.

The point is that many people are ignorant out of ignorance…but many out of the desire to self-delude for a variety of reasons, some selfish some not.

Rich

I very much agree with you. But what a strange article!

It starts by arguing that it is pointless to ask whether a model is real, and that models merely agree to a greater or lesser extent with observation. But it concludes that a good model: â€œmakes detailed predictions about future observations that can disprove or falsify the model if they are not borne outâ€. Is that not a contradiction?

Surely, “falsification” does not disprove such models, it merely shows that they have limitations. The models are still useful within their limitations. Talk of disproving/falsifying models is redolent of the idea that knowledge is composed of statements that are either true or false – which in the case of science it is not so. Partial/approximate knowledge is also knowledge.

Another desirable, but not essential, characteristic for a scientific model/theory would be coherence with existing knowledge structures. Reconciling quantum mechanics with classical physics still causes headaches.

Horseshoes and hand grenades.

I think Rich has nailed the issue on the head. Attempting to use statistics to discover truth is impossible. Statistics is NOT a tool that lets us peer inside the black box that is the “data generating process”. Instead, it helps us to understand what will come out of the box for a given set of inputs. While that insight will certainly help us judge the value of insights gained through other means, statistics seems wholly inadequate to the task of revealing objective truth.

Can I make a suggestion on how to check step 5? Please shoot me down (I’m sure people will but how else do I learn?).

You carry out an OLS assessment on half the data – this gives you a best estimate value for the slope and intercept.

A confidence elipse can be calculated using standard methods. Plot the ellipse. This elipse contains all the values for slope and intercept that would be acceptable at (say) 95% confidence based on the first half of the data.

Carry out the OLS on the other half of the data and plot the best fit slope and intercept. Do they fall inside the ellipse? If so, the two data sets are consistent in OLS analysis.

Does this work?

Dear Mr. Bates,

Why do half and half? There is a procedure known as “bootstrapping” that randomly excludes some of the data, repeatedly in Monte Carlo multiple runs. Ersatz “confidence” intervals can be generated.

But it all proves nothing. Might as well build your model with all the data, and validate (or invalidate) the model with future predicted data. The past is known. If your model does not fit the past, it is truly a lame duck that will fit the future only by extreme coincidence, akin to religious prophecy.

Some models are functional and worthwhile. For instance, when I cruise timber I use a model, with tree measurements made in the field as inputs, to predict the cut-out volume of the logs as scaled at the mill. There are many uncertainties, such as measurement error, sampling error, how the logger might buck the tree, etc. But in almost all cases, my model estimates are pretty darn close, surprisingly so even to me. I have cruised large tracts and my estimates have been within a few board feet on the final cut-out tally. Really impressed the client, and myself!

There are other types of models that are not built with data — theoretical models that attempt to describe inner workings of the black box. Those have some value in helping to understand the system in the box, but they have little predictive utility, because they are not data-driven. Climate models are a good example.

The ptolemaic model was also based on the notion of perfect figures, the circle being the most perfect figure there is. As heaven is perfect, everything in heaven must be described using perfect things. An ellipse is not perfect and it is therefore impossible to have elipses describing things in heaven, like the orbit of the sun and the planets. That is a physical argument, not a mathematical one.

The idea that there is no mathematical difference is a modern one, Ptolemeus himself and his contemporaries would not use that argument.

The same for the Earth being the center of the universe. Heavy things, earth, water, would fall down to the center of the universe. Light things, fire and air, would rise up. The idea that things on earth could behave the same as things in heaven did not exist. On the contrary, things in heaven, the sun, the moon, the planets, kept moving, but on earth all movement stopped after some time. So heaven was clearly, obviously, different from Earth. No way Earth could move, and even if it could move, it would have stopped anyway.

As an econometrician I would argue that your experience talks more loudly of nature and content of the seminar rather than the use of statistics in economic in general.

Out of sample testing for all manner of parameter stability functional form etc. would normally be standard requisite for any paper authors wished to get published.

Geckko,

What I see in economics/econometrics is more akin to “Step 5”, with fudging to assure the best model(s) fits the out-of-sample sample. I have yet to come across the economics paper in which a true “Step 6” has been done predictively. As in, “Last year, we built model M. We made these (probabilistic) predictions. We waited for new data. Here is the verification using proper scores.” If you find one of these papers, please let me know about it. I’ll be amazed.

Modelling economic series: readings in econometric methodology

CWJ Granger

“Selecting models by the goodness of their ‘dynamic simulation tracking record’ is invalid in general and tells us more about the structure of the models than about their correspondence with reality”

Matt, I presume you are catching up on some reading. Here is a few thousand more to look at.

http://scholar.google.com/scholar?hl=en&q=ex+ante+forecast+validation+economics&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0

â€œInnocent until proven guiltyâ€ may be a myth, but in a classical statistical hypotheses testing, before any conclusion is made, the data evidence is evaluated under the assumption that the null (innocent) is true.

The purposes of statistical modeling, classical or Bayesian, are to understand the uncertainty in data and relationships among variables, to make a decision or prediction about what we havenâ€™t yet seen, and others. Just because some bad statistical practitioners do certain things, it doesn’t mean that â€œclassical Statisticsâ€ do those things.

I have a lot to say about this post, for example, cross validation, but I am too lazy today.

Geckko,

Thank you. You’re right. I did not mention the large literature (especially from J. Scott Armstrong et alia’s journal) on forecast validation. I have read hundreds of these papers, and though I disagree with many of the methods (mostly in treating as points forecasts that should be evaluated as probabilities), I love that they exist. Much of my published theoretical work is on similar lines.

But I still claim that you do not see, in the same paper, explications of a model and (truly) independent verifications of it. But I do applaud that many investigate models in an

ex antesense.There is a paper by Persi Diaconis (that I can’t lay my hands on for the moment) that highlights the type of problem in economics I mean. A model with dozens of regressors, small p-values, but no sense. And certainly no attempt at verification. These papers are surely the most common.

JH,

Cross validation (bootstrapping, etc.) is “Step 5.” It is not an independent test.

Matt,

I would agree that you do not see papers in econometrics or economics of the form with which you would be most familiar, vis:

1. Here is some data we collected from some process/trial

2. Here is our model for the process and empirical analysis of this model

3. Here is the addititonal data we generated by subsequently re-running/extending the process/trial

4. Here is the empirical analysis of our model against the new data

The problem being of course is that the bulk of econometrics is based on time series. MOst likely it os a frequency no greater than monthly, more commonly quarterly and in search of any data integrity, annually.

So for economics the process above has to be adjusted to through necessity:

1. …

2. …

3. Here is us sitting on our bottoms for 30 years until we have any new data to use.

4. … the world has moved on….

OK I exaggerate, but it is true.

Now this is not disimilar to the field fo climate science in their desire to “validate” intertemporal model of the earth system.

However, (good) economists are very well aware of the extreme limitations of what their empirical work and use it in an appropriate manner. For example, it can give indications that certain hypotheses are not well supported by available data. More generally empricisim is used to try and find any useful additional insight that can be used with as much other information to form some fuzzy understanding.

So generally I feel you criticism of the use of statistics in economics is not well founded, although, as in any discipline there is no shortage of poor work. Economists are generally pretty pragmatic about data and empiricism, in fact a well worn phrase in the field is “stylised facts” where others might use the term “data”.

Very interesting discussion and here are a few “my 2 cents” comments:

1) I’d suggest that statistics (whether the usual frequentist version or the Bayesian) has to do with analyzing data (emphasis on data) and bugger-all with scientific hypothesis (emphasis on ‘scientific’). In that respect, statistics has little, if anything, to contribute to discussions of the philosophy of science. If you elevate every speculative assertion to the status of a “scientific hypothesis” then, indeed, you may think that you’re dealing with testing ‘scientiifc hypotheses’. Scientists may use arithmetic, but arithmetic is not science and it is not what makes something “science”. Thus statistics has little to do with deciding between a Ptolomeic vs Copernican view of the world. In fact, a tweak to the various epicycles of the Ptolomeic view would have sufficed to make it fit the data; and data was not necessarily the crucial issue that made the Copernican view pevail. A lovely corrective to the common misconception that it is merely “data” that drives (or is even decisive in) acceptance of theories is an old article in Science, written by Stephen Brush, titled “Should the History of Science be Rated X?”. It is not a coincidence that hypothesis testing prevails in the social sciences and is nearly absent from tracts written for physicists. I think this has its roots in “physics envy” and the (convenient) view that as long as you’re manipulating numbers and formulae, you “must” be doing “real” science. Hence we have Wigner using a distribution with no mean (cousin to the Cauchy) because it “fits” (and it was pragmatically a success), whereas in the view obsessed with conventional “hypothesis testing” such a deed would be unacceptable. Dirac invented the positron because it “fit” his mathematical view, not because of some data-based imperative. Another example is Special Relativity. In college physics texts, the story is told about the famous Michelson-Morley experiments and then, it is told, Einstein came up with the Special theory to “explain” it. BS ! It is true that the Special theory explains away the M-M results. But the Michelson-Morley experiments had nothing to do with the Einstein’s inception of the Special theory (in fact, the M-M experiment is relegated to a footnote in the 1905 article with the mention that the proposed theory, as an incidental, explains these results).

2) Even Fisher (yes, Fisher !) did not think that hypothesis testing was appropriate for testing ‘scientific’ hypothesis, even though he is viewed as the originator of this bad habit. Fisher viewed statistical hypothesis testing as a screening tests and ridiculed Neyman’s view that a scientific hypothesis should be rejected (or not) based on a decsion driven by a p-value. Fisher’s view was, in that respect, that of a scientific adminstrator : I get tons of ‘results’ poured on my desk each week; I can’t pay attention and follow-up everything; so what should I pay attention to ? I’ll screen them ! If p<0.05, I think these results are worthy of note and I'll look into them more closely (e.g. maybe repeat the experiment or repeat it differently, etc.). This is not a quote but a paraphrase of Fisher's statement; but he did say that he'd "even" consider publication if p<0.01 (the implication being that this, stricter criterion, might suggest that the result of the screening procedure might be worthy of wider dissemination to a general audience to follow-through with). Fisher was, in addition to being an excellent mathematician, a scientist. Neyman wasn't and came from a mathematical "place", one, moreover, derived from a 'quality control' type thinking.

3) Statistical hypothesis testing (even in randomized trials, let alone observational studies) shows us, at best (!), an association between, say, 2 attributes. Do they, e.g. come from the same distrbution or is it, for example, 'noise'. Insofar as a scientific hypothesis is about causes, mechanisms etc., HT have nothing to say. At best, they provide hints. We all know that "correlation is not causation". I think that this is true about ALLstatistical hypothesis testing, not just literally about correlation coefficients. I.e. they provide heuristics. When we see a statistically significant finding (even if it was replicated in a 'new' study population, e.g. point 6 in the previous discussion) it is up to us to decide whether that's "just" an association (albeit a possibly useful one) or it's suggestive of "scientific" pursuit. There's a famous dataset from Germany whereby a (statistically significant) relation was found between stork sightings and rates of birth. Even if it was replicated elsewhere (or at a different time), I'd suggest that it wouldn't be worthwhile folowing this if one is interested in the mechanisms of procreation. The explanation is "obvious" – both births and stork migration are subject to seasonal fluctuations. But it's NOT statistics that will tell us that (although a different experiment, if done, will). Let's take a converse example, in this case a hypothetical one. Suppose a scientist in Ireland in the 1930's, say, finds that the fair-haired have a lower IQ than dark-haired with p<0.05. Is this worthwhile following up as to why? Before she's given a multimillion dollar grant to pursue this lead, the study is replicated in another country (say the US) and NO association is found when the studies are of similar size. I.e. this failed the 'predictive' step of testing. So it might appear that this is not worth pursuing! Actually, this would have been very worthwhile to pursue (of course, with hindsight). Looking into why fair-haired people in Ireland have a lower IQ, might have (hypothetically) led to the singling-out of a particular sub-group of the fair-haired that are "responsible" for the lower IQ and the delineation of PKU (phenylketonuria, a disease that if untreated, is associated with fair-hair, lower IQ and that has a higher prevalence in Ireland) and the finding of its gene. Now, in hindsight, the lack of replicability was likely due to a combination of lower prevalnce in the US (i.e. lower power) but the point is that, a-priori, statistical analysis would not (again, a priori) shed light on whether there's 'science' behind this or whether it's a dead-end. Randomization might tilt the odds in favor of an assciation being 'substantive', but it's no guarantee. Does anyone believe that all randomized studies will lead to the same conclusion in all cultures, all circumstances etc. ? There are (always) ancillary factors, and finding these ancillary factors might be a pursuit of science, but not necessarily statistics. If a randomized trial showed that drug A is better than B in a certain population but that a similar randomized trial in a different population, showed that there was no difference (or that B was better than A), I'll have to decide between : a) a fluke/error, b) there's something different in the populations etc. My decision might be informed by some statistical re-analysis of the data etc. but, basically, the decisive parts of the decision might be based on other biological data etc., but the reasoning will NOT be statistical. It would be more related to the same reasoning that I use when, driving up to an intersection, I hazard a guess that turning right will lead me to a different palce than where I to turn left !

Well, 'nuff said !

Gadi