I received this email from long-time reader Ivin Rhyne, which is so well put that I thought we should all see it (I’m quoting with permission):

Matt,

I just got back from a conference on historical economics and was absolutely bowled over by the repeated usage of t-tests and p-values as the arbiter of whether an hypothesis is false or not. Allowing for the subtleties of “reject” vs. “unable to reject” my question is more numerical.

My personal understanding of regression analysis to test “fit” a model to the data is as follows:

1. Form a hypothesis

2. Gather some data to test your hypothesis

3. Translate your hypothesis into a form that is mathematically testable (let’s assume OLS regression is a good mathematical expression of your original hypothesis)

4. Using part of your data, calibrate the OLS by running it to get some numerical parameters that then become an intrinsic part of your hypothesis

5. Using the rest of your data (the part you DIDN’T use to calibrate the model) you actually insert the data points for the independent variables and then compare how closely the dependent variable matched the actual values.ALL of the papers presented at the conference stopped at step 4. Their test of the hypothesis was simply whether the model could “calibrate” to the data in a way that generated coefficients that had “acceptable” p-values.

My questions are as follows:

1. Have I missed something and in fact theirs is the correct approach to hypothesis testing of social science data?

2. If I am right (in principle) about how to test hypotheses, can you point me toward (or perhaps even better lay out in your blog) what kind of test is appropriate for step 5 described above for an OLS regression?

As always, I appreciate your insights.

Ivin

This nails it. I have rarely seen a sociological or other “soft science” paper venture beyond Step 4. A few make a stab at Step 5, but usually in such a way as to dissolve the force of this Step.

It’s cheating, really, and done by formulating several models, usually the same underlying OLS but with different sets of “regressors” for each model, and then each is tested (crudely) via a Step 5. The one that’s best, or the one that is best within the subset matching an author’s desires, is the one that makes its way to print.

I hope you can see that doing this is just the same as skipping Step 5. Or it’s equivalent with Steps 1 – 4, but with a “meta” model. Regardless, it is using the data you have in hand to massage a model into a shape that is lovelier to the eye. It is *not* an independent test of your model’s goodness.

Ideally, there is no Step 5—all data you have in hand should be used to construct your model—but there should be a Step 6, which is “Wait until new data comes in and test the model *predictively*.” All physical sciences do Step 6—with the exception, perhaps, of climatology, where it’s the “seriousness of the charges” that counts.

Passing a Step 6 does not, of course, guarantee the truth of a model. Just look at the Ptolemaic systems of cycles, epicycles, semi-cycles, and so on ad infinitum. Wrong as can be, but still useful. The model passed Step 6 for centuries, which is one of the reasons few thought to question its truth. Don’t mess with what works!

From this history we learn that passing Step 6 is a necessary but not sufficient condition in ascertaining a model’s truthfulness. Spitting out a p-value (Steps 1- 4) that is less than the magic number is not even a necessary condition; and anyway, the p-value was *purposely designed not* to say anything about a model’s truthfulness.

We must remember that, for any set of evidence (data), any number of models can be made to explain that data; that is, you can always find models which fit that data. Simply touting fit—as in Steps 1 – 4, and the p-value’s main job—is thus very weak evidence for a model’s truth.

Why aren’t more “Step 6″s being done in statistics? It’s not that it’s difficult computationally, but it is expensive and time consuming. It’s expensive because it costs money to collect data. And it’s time consuming because you have to wait, wait, wait for that new data. And while you’re waiting, your wasting opportunities for “proving” new theories.

Much more to this, of course. For example, why do some models work even when people flub the steps? Because models are chosen with reference to external probative information. We’re obviously just at the beginning of a discussion.