New Paper: Don’t Test, Decide — Updated: Now With Paper Link!

Paper link.

By moi. Abstract:

There is no reason to use traditional hypothesis testing. If one’s goal is to assess past performance of a model, then a simple measure or performance with no uncertainty will do. If one’s goal is to ascertain cause, then because probability models can’t identify cause, testing does not help. If one’s goal is to decide in the face of uncertainty, then testing does not help. The last goal is to quantify uncertainty in predictions; no testing is needed and is instead unhelpful. Examples in model selection are given. Use predictive, not parametric, analysis.

Some meat from the Testing Versus Deciding section, mostly de-LaTeXified.

Suppose we have the situation just mentioned, two normal models with different priors for the observable $y$. We’ll assume these models are probative of $y$; they are obviously logically different, and practically different for small $n$. At large $n$ the difference in priors vanishes.

A frequentist would not consider these models, because in frequentist theory all parameters are fixed and onotlogically exist (presumably in some Platonic realm), but a Bayesian might work with these models, and might think to “test” between them. What possible reasons are there to test in this case?

First, what is being tested? It could be which model fits D, the past data, better. But because it is always possible to find a model which fits past data perfectly, this cannot be a general goal. In any case, if this is the goal—perhaps there was a competition—then all we have to do is look to see which model fit better. And then we are done. There is no testing in any statistical sense, other than to say which model fit best. There is no uncertainty here: one is better tout court.

The second and only other possibility is to pick the model which is most likely to fit future data better.

Fit still needs to be explained. There are many measures of model fit, but only one that counts. This is that which is aligned with a decision the model user is going to make. A model that fits well in the context of one decision might fit poorly in the context of another. Some kind of proper score is therefore needed which mimics the consequences of the decision. This is a function of the probabilistic predictions and the eventual observable. Proper scores are discussed in [paper]. It is the user of the model, and not the statistician, who should choose which definition of “fit” fits.

There is a sense that one of these models might do better at fitting, which is to say predicting, future observables. This is the decision problem. One, or one subset of models, perhaps because of cost or other considerations, must be chosen from the set of possible models.

There is also the sense that if one does not know, or know with sufficient assurance, which model is best at predictions, or that decisions among models do not have to be made, that the full uncertainty across models should be incorporated into decisions.

The two possibilities are handled next.

You may download this peer-reviewed wonder here. It will soon appear (March?) in a Springer volume Behavioral Predictive Modeling in Econometrics. I don’t have the page numbers for a citation yet, but it will be in the “Springer-Verlag book series ‘Studies in Computational Intelligence’ ISSN: 1860-949X (SCOPUS)”.

When I have some more time, I’ll post the R code and make it a statistics class post so those interested can follow along.

24 Thoughts

  1. In machine learning the concept of how well a model predicts unseen data is inherent in the model design. In all models, it is impossible to know what future data may bring. You can test a model against future datasets and, while how well the model predicts those may increase you confidence in the model, you just never really know. The scientific method is based on this idea.

    Yes, what constitutes a good fit needs to be known. But it needs to be specified before the model is built. Some people might want race cars and others trucks but using a race car for a truck is often disappointing from a performance perspective and vice versa.

    There are several ways to do get a handle on future performance but all are really variations of the same procedure:
    1) hold out some data for validation
    2) train on the remaining data
    3) see how well the model predicts the held out data

    Doing this repeatedly with different holdouts gives a better estimate of performance of training on the entire dataset. The most accurate estimate is when the holdout consists of one data point (leave one out approach). This still leaves open the possibility that the data at hand are biased in some way.

    Ensemble modelling can be used if multiple models are available. Using ensembles tends to improve model generalization but, again, only time would tell.

    How well the model fits the training data is not so important but there is still the problem of underfitting (aka model bias) so some testing against the training data is needed. But this would involve seeing how well the training data are predicted as if they were unseen data.

    A brief overview on fitting:
    https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

    A brief discussion on ensembles:
    https://www.analyticsvidhya.com/blog/2015/09/questions-ensemble-modeling/

    Build models in economics is like building models in psychology. It’s mostly wishful thinking. There really is no way to test hypotheses. That said, there may be indicators of future behavior in say stock prices that can be measured.

  2. I found these typos using p-values:

    assoociated
    uncertanity
    probabtive
    strenghtened
    acknowldgement
    quandrant
    penality
    onotlogically

    I’m not going to believe the approach you propose is any better frequentist or Bayesian approaches until real problems are solved. 🙂 Any real examples?

    Justin

  3. Justin,

    All typos provided free of charge. I’m hoping Springer catches just as many as you did.

    Also, I’m sure you worked through all the class problems, and are still preparing rebuttals for the formal arguments against p-values in the papers I wrote.

  4. … is like building models in psychology. It’s mostly wishful thinking. There really is no way to test hypotheses.

    Nope, works just fine in consumer psychology (e.g. testing and selling physical products), if you have skill. Google paired comparisons or discrete choice sometime, where you observe actions and predict future actions. Problems pop up when you pretend you’re a physical scientist or an engineer, assign numbers and try to work with those as if they represent things.

    Drop that physics/engineer mindset and use orders and partial orders on longitudinal actions/choices and things start to work. Briggs has a number of columns that touch on “measuring” emotions on the real line…

  5. A few more:

    “veritical”
    “there import”

    Seriously though, I know literally no one who believes a parameter being constant means that parameters exist in a Platonic realm.

    Justin

  6. I was hoping it just hadn’t been posted, because based on personal experience, it could have been my own device hiding a “malicious” link. Thank you for updating the post.

  7. If weights of individual people are a real thing, then the total weight of everyone in the US (parameter) is a real thing, for example, at a time t. I can learn about that parameter by taking a sample at time t.

    Justin

  8. If one’s goal is to assess past performance of a model, then a simple measure or performance with no uncertainty will do

    Maybe a p-value?
    – Prior model: – check,
    – simple measure: – check,
    – compare observed performance: check,
    – no uncertainty: check (for exact calculations or bounds, anyway)

  9. Bill_R,
    – compare observed performance: check
    P-values don’t tell you how well a model performs.

    A simple R example,
    x=1:1000; y=sin(x)*x-0.5*x; mdl=lm(y~x); plot(x,y); abline(mdl); summary(mdl)

    p-value of slope: <2e-16

    But since it is a linear model, it captures only the dc component of Y which is 0.5*x. IOW: what most would call the trend. It doesn't predict even the training data well.

    Residuals:
    Min 1Q Median 3Q Max
    -978.58 -259.64 0.12 258.70 987.92

    (I’m guessing pre works here. Not sure if the blog code allows it.)

    The residuals (err= Y-mdl(X)) do show performance. Presumably, abs(err)==0 would be better. The wonderfully low p-value is useless information.

  10. DAV,

    Nice troll.

    It worked precisely as it’s supposed to. Like many, you never bothered to specify the prior model, and blindly chose a summary statistic. The p-value indicates quite clearly that the prior model for that summary statistic does not fit, and if you bothered to look at the distribution of the slope under the prior model (e.g. your unspecified starting or null model) you would see that. The p-value doesn’t give you the magic answer, it just indicates that your starting assumption was incorrect for the specific statistic you chose.

    “It’s a poor workman who blames his tools.”

  11. Bill R,

    Don’t be confused by the summary(mdl). R only computes the p-value in the summary function.

    So, p-value < 2e-16 means bad model?
    Oddly when Y=X, the p-value is the same but the residuals are effectively zero.

  12. your unspecified starting or null model

    In case you’re wondering, for a linear regression:
    Y=?+?X+? (model)

    H0:?=0 (null hypothesis: no trend)
    HA:??0 (alternative hypothesis: has trend)

    Clearly the trend was captured but the resulting model (for y~sin(x)*x-0.5*x) fails to predict Y accurately even on the training data. If all I cared about was how well the average Y value might be predicted (at least against training data) then it’s a good model but future performance is unknown and would be even if p-value somehow indicated performance on the training set.

    The point is though, the p-value of the slope (or p-value of any parameter) doesn’t really address the performance of a model.

  13. Hmm,

    Didn’t take the Unicode characters as I expected.
    Read:

    Y=alpha + beta *x + epsilon
    H0: beta==0
    HA: beta!=0

  14. DAV

    Yes, a low p-value means the default prior/null/reference distribution, which you didn’t specify, grossly mis-specifies the sample slope (which you got by default when you pushed the `lm` button).

    The teeny-tiny p-value just means the default approximation has hit it’s limits and you are chasing noise. On my laptop .Machine$double.eps is about 2.25e-16.

  15. DAV,
    The p-value doesn’t say anything about how well the model predicts the data, nor is it supposed to. It only compares the observed sample statistic with the predicted reference distribution

    Talking about HA suggests you’re conflating p-values with NP. Fisherian P-values are purely about the reference distribution and the observed statistic.

    If you want in-sample predictive assessments, consider split-samples, the jackknifes, (generalized) cross-validation, bootstraps, etc. All are standard statistical tools from way-back, at least the late-60’s from my experience. Google has lots of pointers to those.

  16. The p-value doesn’t say anything about how well the model predicts the data, nor is it supposed to.

    So you were drunk or otherwise insane in the following? What exactly do you think “performance” means?

    If one’s goal is to assess past performance of a model, then a simple measure or performance with no uncertainty will do
    Bill R https://wmbriggs.com/post/28972/#comment-183810
    Maybe a p-value?
    – Prior model: – check,
    – simple measure: – check,
    compare observed performance: check,
    – no uncertainty: check (for exact calculations or bounds, anyway

    BTW: I think Briggs actually meant to say:
    a simple measure of performance with no uncertainty will do

  17. DAV,

    Nah, I stopped drinking about 15-20 years ago. You might want to try it.

    This is what happens when you don’t clearly specify your starting model Pr(Y|X,M). Briggs pounds the pulpit on that for a reason.

    – The implied starting model in your lm example is that “the sample slope wrt x should be about 0.” (along with tacit distributional assumptions.)
    – The simple measure is a rank of the slope within that sampling distribution of slopes,
    – the observed performance is the p-value. (the observed rank of the sample slope),
    – no uncertainty – the initial model, the statistic, and the data are all fixed.

    If the observed performance of the model on some aspect of the data is poor, then it’s reasonable to conclude the starting model is a poor approximation to that aspect of the data or data generating process.

    If the question is about future performance, then your updated model becomes the new null model, and some future, unseen, data becomes the “sample” value. And that is uncertain. P-values are not of much interest here.

    It might help if to consider the difference between deduction (a p-value) and induction (a statement about future behavior). If your starting model had been “y has an unspecified linear component in x” then the problem would have been a simple estimation/induction, with no need for a p-value. If it had been “y is strictly linear in x” then the exercise should use a different sample statistic (e.g. a smoothing spline, or kernels, depending on the physical problem) to assess the performance for prespecified slopes. You could add an “analyst degree of freedom” and specify the slope after you get the data, and use things like subsamples or bootstraps to assess performance. (Are you familiar with the work of Patrick Laurie Davies?)

  18. Umm, yeah, sure.

    Them durn ML guys have really missed the boat what with their obsession with reducing prediction error in training and validation data and all that while not caring a whit about model parameters and/or p-values. Explains their poor models.

Leave a Reply

Your email address will not be published. Required fields are marked *