Class - Applied Statistics

Choose Predictive Over Parametric Every Time

Gaze and wonder at picture which heads this article, which I lifted from John Haman’s nifty R package ciTools.

The numbers in the plot are made up whole cloth to demonstrate the difference between parameter-centered versus predictive-centered analysis. The code for doing everything is listed under “Poisson Example”.

The black dots are the made up data, the central dark line the result of the point estimate of a Poisson regression of the fictional x and y. The darker “ribbon” (from ggplot2) is the frequentist confidence interval around that point estimate. Before warning against confidence intervals, which every frequentist alive interprets in a Bayesian sense every time, because frequentism fails as a philosophy of probability (see this), look at the wider lighter ribbon, which is the 95% frequentist prediction interval, which again every frequentist interprets in the Bayesian sense every time.

The Bayesian interpretation is that, for the confidence (called “credible” in Bayesian theory) interval, there is a 95% the point estimate will fall inside the ribbon—given the data, the model, and, in this case, the tacit “flat” priors around the parameters. It’s a reasonable interpretation, and written in plain English.

The frequentist interpretation is that, for any confidence interval anywhere and anytime all that you can say is that the Platonic “true” value is in the interval or it is not. You may not assign any probability or real-life confidence that the true value is in the interval. It’s all or nothing—always. Same interpretation for the prediction interval.

It’s that utter uselessness of the frequentist interpretation that everybody switches to Bayesian mode when confronted by any confidence (credible) or prediction interval. And so we shall too.

The next and most important thing to note is that, as you might expect, the prediction bounds are very much greater than the parametric bounds. The parametric bounds represent uncertainty of a parameter inside the model. The prediction bounds represent uncertainty in the observables; i.e. what will happen in real life.

Now almost every report of results which use statistics use parametric bounds to convey uncertainty in those results. But people who read statistical results think in terms of observables (which they should). They therefore wrongly assume that the narrow uncertainty in the report applies to real life. It does not.

You can see from Haman’s toy example that, even when everything is exactly specified and known, the predictive uncertainty is three to four times the parametric uncertainty. The more realistic Quasi-Poisson example of Haman’s (which immediately follows) even better represents actual uncertainty. (The best example is a model which uses predictive probabilities and which is verified against actual observables never ever seen before.)

The predictive approach, as I often say, answers the questions people have. If my x is this, what is the probability y is that? That is what people want to know. They do not care about how a parameter inside an ad hoc model behaves. Any decisions made using the parametric uncertainty will therefore be too certain. (Unless in the rare case one is investigating parameters.)

So why doesn’t everybody use predictive uncertainty instead of parametric? If it’s so much better in every way, why stick with a method that necessarily gives too-certain results.

Habit, I think.

Do a search for (something like) “R generalized linear models prediction interval” (this assumes a frequentist stance). You won’t find much, except the admission that such things are not readily available. One blogger even wonders “what a prediction ‘interval’ for a GLM might mean.”

What they mean (in the Bayesian sense) is that, given the model and observations (and the likely tacit assumption of flat priors), if x is this, the probability y is that is p. Simplicity itself.

Even in the Bayesian world, with JAGS and so forth, there is not an automatic response to thinking about predictions. The vast, vast majority of software is written under the assumption one is keen on parameters and not on real observables.

The ciTools can be used for a limited range of generalized linear models. What’s neat about it is the coding requirements are almost none. Create the model, create the scenarios (the new x), then ask for the prediction bounds. Haman even supplies lots of examples of slick plots.

Homework: The obvious. Try it out. And then try it on data where you only did ordinary parametric analysis and contrast it with the predictive analysis. I promise you will be amazed.

11 replies »

  1. We can only see the plot above around 11.5-12. The bottom is hidden.

    Interesting that of what we can see, almost none of the data points fit the curve.

  2. As McChuck said, we can’t see the plot. I wasn’t sure if this was intentional—some kind of Jedi mind exercise or something, or just a usual cyberspace thing. The light gray behind the port text overshadows it, so I went with cyberspace as th most likely. I’d do one of those fancy analyses on the probability of this being the case, but I have to go peel boiled eggs. Maybe later…..

    It is correct that people only want to know the probability of an event—sadly, statistics is very weak on this. I personally go with the crystal ball, dart board and sometimes I ask my dog. (Sorry, Matt, it is what is.)

  3. McChuck,
    You are comparing observables (the y’s) with a mean curve conditional on x. A discrete response doesn’t hit a continuous x, generally. Consider:
    table(df_ints[,2]== df_ints[,3])
    None of the predictions hit the data. Instead look at
    updown<- sign(df_ints[,3]-df_ints[,2])
    table(updown)
    to get a better idea how the mean curve performs. In this case, 93/100 observables are in prediction bounds.

  4. “So why doesn’t everybody use predictive uncertainty [what will happen in real life] instead of parametric [what the model has]? If it’s so much better in every way, why stick with a method that necessarily gives too-certain results.

    “Habit, I think.”

    THAT is only partly/sometimes true. Closely aligned with, but separate from, convenience.

    Ignorance is another (e.g., N. Taleb & his emphasis on “black swan” events that occasionally occur but are impossible to know of because there is no information in the existing data is a clear example — especially the perspective of turkeys seeing human interest in their welfare reinforced for a 1000 days straight…and then the big Thanksgiving surprise).

    One BIG reason predictive is sidestepped is corruption — where dire risks are clearly possible and readily imaginable willful deception is repeatedly observed to downplay the risks of potential extreme events. With some regularity, markets are contrived, the trends of increasing wealth are contrived, and investments exploiting the “sure thing” trends are invented. If a market trend itself wasn’t contrived, there will be someone there using selectively chosen info to sell you/someone something using, in part, data that says the good times will continue.

    There’s something about human nature that enjoys being fooled (e.g., a magician’s motto, “People like to be fooled,” and, Penn & Teller’s “Fool Us” show) that short circuits objectivity & renders most of us susceptible to a myriad of scams and bandwagon appeals because the prospects look so sure. Not to mention that nobody wants to be the one that missed out on such an obvious trend, many of us will choose to assume the trend will continue indefinitely (or forever). The real estate bubble, worldwide (except Germany for the most part) is such an example. It goes to the old expression, ‘if you don’t know who the fool in the market is, its probably you.’

    The problem with Briggs’ essays on philosophy & statistics is that too often a failure in logic is attributed to flawed reasoning — not the corrupt intent underlying the analysis & presentation. From the perspective of the salesman applying the “faulty logic” their logic is impeccable — it works to dupe the prospect into buying.

    One of the keys to ensuring one is either applying good logic, or is not getting duped is to understand the behavioral/psychological factors that induce oneself into being duped. Part of that is recognizing the key underlying factors. For example, Lovallo’s & Kahneman’s “Delusions of Success: How Optimism Undermines Executive’ Decisions”

    Bitcoin cryptocurrency, for example, lacks intrinsic value — buying that is based on some other sucker coming along and paying you more. Now at about $8500, it recently reached about $16,000. Based on what??? Chances are many readers here have talked to someone advocating buying bitcoin because the price keeps rising — the “Greater Sucker” theory — and chances are if/when you encounter this no matter how hard you try to inject logic into the analysis you cannot get them to realize that they may be the sucker left holding a worthless bag.

    Such “bubbles” were observed nearly two centuries ago, see: “Extraordinary Popular Delusions and the Madness of Crowds,” first published in the mid-1800s. One might not understand the financial implications, but there is a pattern to such kinds of trendy, and wrong, thinking that result in “bubbles” and other fads.

    Briggs can say what he will of logic & stats, but people have not and show no signs of ever changing where it matters. Understanding the logic, or flaws in logic, of stats will not solve this problem.

  5. For your reading pleasure:

    Geisser, S. (2017). Predictive inference. Routledge

    Clarke, Bertrand S., and Jennifer L. Clarke. Predictive Statistics: Analysis and Inference beyond Models. Vol. 46. Cambridge University Press, 2018.

    @ken
    And sometimes you are interested in population (e.g. lot, market, etc.) averages, particularly when the response is physically constrained. Take the above model and predict the (mean or total) of 1000 units. The predictive probability in the above examples are for one-offs, the hardest case.

  6. Greetings,

    I’d like to refer to these two paragraphs you posted first:

    [quote] The Bayesian interpretation is that, for the confidence (called “credible” in Bayesian theory) interval, there is a 95% the point estimate will fall inside the ribbon—given the data, the model, and, in this case, the tacit “flat” priors around the parameters. It’s a reasonable interpretation, and written in plain English.

    The frequentist interpretation is that, for any confidence interval anywhere and anytime all that you can say is that the Platonic “true” value is in the interval or it is not. You may not assign any probability or real-life confidence that the true value is in the interval. It’s all or nothing—always. Same interpretation for the prediction interval [quote]

    The trouble is that the frequentist interpretation you cite is not how mostresearchers/consumers of research understand it. Many, ironically, think of the CIs in a way the first paragraph describes.

    Now, I’d like to know what it exactly means to have 95% (or any other value) probability as described in the Bayesian interpretation (1st paragraph). The reality, in both cases, is that the point estimate will or will not fall within the band. Nothing can “fall within” anything to the extent of 95%.

    73% probability of some disease does not mean I’ll get 73% of it, but not the other 27%. I’ll either get it or not. Either way the outcome is dichotomous. All of this makes sense only if (theoretically) the same experiment is performed many times over (until the end of time), under the same conditions. In 95% of those cases the point estimate will have been captured by the interval, and in 5% of cases it will not. Basically, in the long run. The catch is, nobody knows how long ‘in the long run’ is.

    It may be useful to use the probabilities to make life choices, but it is a far cry from guaranteed outcomes. If I am faced with two medical procedures, one with 91% success rate and the other one with 95%, I can choose less risky one (95%). However, if I am the one of those that the other 5% is made up of, all these numbers are useless for me personally, at that one time. I may go for the more risky one (91%) and be fine, or choose the ‘safer’ one and have complications.

    As you said in another post: ‘nothing HAS a probability’. It’s us, perceivers, who assign probabilites to events, given our astronomically small experience of ‘events’.

  7. With apologies to Groucho Marx:

    >This is my probability. If you don’t like them, I have others….

  8. Years ago I tried to make Matt enthusiastic about making such a graph even more descriptive by using a gradient instead of a solid color for the graphing of the credible interval.

    I couldn’t quickly find a link to a picture so you can see what I mean, but the idea is that within the 95% credible interval (given the evidence, the model, etc., and all else being copacetic), not everything has the same probability. Yes, all values are within 95%, but some values within the credible interval are “more probable” than others. Values nearer the ‘center’ of the calculated credible interval (near the calculated regression “line”) are more probable, while values near the fringes of the credible interval are decidedly less probable. That’s the reality.

    And no graphing program, so far as I know, including the ones generated by this awesome R program, depicts this reality.

    So, to reflect that, you don’t color the entire credible interval with one color. You start in the ‘center’ of the credible interval, and make the color there (say) 60% grey, and then gradient out that grey so there’s less and less grey and more and more white the nearer to the fringe of the credible interval that you go, in rough proportion to how probable the value is there.

    Yes, Matt, I know it gets trickier for multi-dimensions, but still: there are a lot of regressions like the graphs in your post, and they could all benefit from introducing a color gradient within the credible interval.

    Just getting this idea out there for enterprising programmers.

  9. Bill –
    I am observing the pictured chart at the top of the blog post – what I can see of it. Of the 9 observables, 2 lie outside the light grey area, 1 just touches the dark grey, and 1 lies on the line.

    When 22% of the data points lie entirely outside the predicted area, and 78% lie outside the 1st SD prediction, the model isn’t very good.

    As for the rest of your comment, I will admit to having no idea what it means. I am not a professional statistician, and I do not “R”.

  10. @McChuck,

    Please excuse the delay in replying.

    Matt’s link leads to some r code that produces the plot. If you run that r code, it produces a dataset named `df_ints`. The second column of that dataset contains the observed (integers), while the third contains the predicted means at that observable’s x value. If you test the equality of those two numbers, and display them as a table, you’ll find that none of the observables (integers) are exactly equal to the continuous mean.

    The second piece of code was a copying error on my part. It should have been
    “`
    inout =df_ints[,6])*(df_ints[,2]<=df_ints[,7]))
    sum(inout)
    “`
    This check to see if the observed (column 2) is greater than or equal to the lower prediction bound AND less then or equal to the upper bound. For these data, 93 of 100 are in the prediction bounds.

    The part I originally cut and pasted looked at whether an observable is above or below the predicted mean.

  11. Contrary to current conventional wisdom, no ‘thing’ has an inherent value economically. Value is in the mind of the economic actor, who makes decisions under uncertainty. That’s right. An economic value is an opinion and is unique to each actor and that opinion is subject to change and maybe very quickly. It isn’t measurable, either, any more than any other opinion is measurable, until the actor makes the decision.

    The actor has an end in mind. The actor evaluates the alternatives available; which are: 1. do nothing, 2. make it myself, or 3. buy it from someone else and buy it means trade with another actor. To make the decision, the actor has an internal rate of return, so to speak (consider: A bird in the hand is worth two in the bush). At the margin, the actor will act if and only if the price is right, for him/her, at that point in time and at that point in space. Only whatever factors that actor has in mind matter and nothing else matters. The reification of mathematical expressions doesn’t help, either. So now you know why so many economic results are “unexpected”. The opinions held by “economic experts” will only rarely be held by the rest of us.

Leave a Reply

Your email address will not be published. Required fields are marked *