How To Do Predictive Statistics: Part VIII — Starting Stan

Review! You must at least review the first lessons—all class material is on one page for ease. I’ll have more words about the mysticism of simulation, but I’ve said it all before and better in the previous links.

Last time we finished up with the MCMCpack extensions. I chose this for its ease and speed, and because of it was mostly automated. Like I said, one reason for the lack of adoption of predictive methods is automatic, out-of-the-box software. Most people won’t or can’t code. It isn’t easy.

We’re now adding complexity. We’ll be able to handle more complicated models—with all the extra confusion that entails. Run these lines of code:

install.packages('rstanarm', dependencies=TRUE) install.packages('betareg', dependencies=TRUE)

These are get-a-cup-of-coffee lines of code. If you’re on Linux (and who isn’t?), it will take a good while (run and go to lunch). You’ll see lots of warnings as the c code compiles. Be patient, even when it seems like it’s stuck. It isn’t.

We last talked last time about the possibility of JAGS, which is external software that R can hook to. It’s fine, but it comes with a heavy price. It takes forever to learn. Mistakes are easy. It can run any kind of model you can envision, which is a great benefit. But it can suck up inordinate time, which is a great detriment.

Instead of jumping into a canicular caldera of code, we’ll softly stroll over to Stan. Like JAGS, it uses MCMC type methods (but not everywhere). Meaning simulations. Meaning mystical thinking.

As I’ve written (many times!), simulations provide approximations to non-analytic integrals (we use integrals because of insisting on approximate models using infinitely valued, i.e., continuous parameters). As long as we keep in mind they are nothing but numerical methods to difficult problems, we’re fine. But as soon as we allow any metaphysical import to the simulation, like talking about “random” numbers, we are lost.

We simply will not progress as a field until we can discover closed-form analytical solutions to the problems we now give to simulations. As Jaynes said (and as I quote in my award-eligible book) “It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought.”

Amen amen amen and amen.

Even when we can—and it is difficult if you have been brought up in the frequentist or Bayesian faith—manage to think of simulations as nothing but numerical approximations, simulations suffer from another huge flaw. They are slow. Try the package brms, which is Bayesian Regression Models using Stan. Even the simplest regression takes minutes (because it exports the model to external C++ code, which first compiles then runs: ugh).

Very little attention (in statistics, anyway) has been paid to the problem of finding analytic approximations to non-analytic integrals. The applied math guys have done tons here, and it would be worth spending time to see the natural crossovers. I am only one guy with no students, no grants, no facilities (I am a reactionary and no Western university will hire me), no nothing except some spare time. So I can’t do this. It will be up to you, dear reader.

Enough of the necessary and, I pray to God, not futile rant. On with the show!

Stan

Has the code completed yet? First thing we’re going to do is run some comparisons of simple models, regression and logistic regression, using MCMCpack and rstanarm. This is only for the purposes of orientation, and to show how differences in simulation produce differences in answers. Next time we’ll try models not available in MCMCpack. I am so sick of the CGPA data I couldn’t use it again even if Cardinal Dolan promised to resign. So we’ll use some built-in datasets. Be sure to first download the latest versions of mcmc.pred.R, mcmc.pred.examples.R.

Regression

First do ?mtcars to learn all the exciting details about the MPGs of 32 cars. And recall the “.” after the tilde tells R to use all the measures in a dataset (except, of course, for the observable on the left-hand-side).

x = mtcars # only to make life smooth

fit.s = stan_glm(mpg ~ ., data = x, QR = TRUE) fit.m = MCMCregress(mpg ~ ., data = x)

As ever, you can try summary(fit.s) etc. to look at the parameter summaries. Of which, as predictivists, we have zero interest. Strike that: make it epsilon interest. Because these are simulations, there are all kinds of niceties to attend to so that we know the approximation has “converged.” Things can be learned about this convergence by examining the posteriors of the parameters. However, we must always remind ourselves that the parameters are not real, they are not ontic. They are nothing but mathematical contrivances that make the model work. Where do parameters come from? All those who read Uncertainty know they are the result of taking finite discrete parameterless models to the limit—simply for the ease of approximation. Real finite discrete probability models have no parameters at all. They are natively predictive.

“Okay, Briggs. Suppose you’re right. Then why aren’t you using these super good native finite discrete models here, instead of these continuous-parameter based ones?”

Say, that’s a good question. Because I’ve only worked out one. Again, I have no students, etc. etc. As a mathematician, I make a great philosopher. If you are a student in need of a problem, boy howdy, have I got some good ones.

Anyway, we don’t care about the parameters, but the predictions. First MCMCpack, as we did before (if you can’t remember everything, review! you can lead a student to code, but you can’t make him learn):


# all the old x
q.m = NA
g = 25  # 25 mpg; why 25? why not?

for(i in 1:nrow(x)){
  p.m = MCMCregress.pred(fit.m,x[i,])
  q.m[i]= sum(p.m>=g)/length(p.m)
}

plot(x$disp,q.m,ylab='Pr(MPG>g|old data,M)',main='MCMCpack')

That plots the probability of MPG>25, for all future scenarios that—hey presto!—look just like the old data. These are predictions! Save this plot. Now let’s to rstanarm.



p.s = posterior_predict(fit.s)

q.s = NA
for(i in 1:ncol(p.s)){
  q.s[i]= sum(p.s[,i]>=g)/length(p.s[,i])
}

plot(x$disp,q.s,ylab='Pr(MPG>g|old data,M)',main='Stan')

The object p.s contains the simulated posterior predictions: 4,000 is the default, one column for each row in the newdata which, since it defaults to all the old data, is one column for each of the original 32 observations. A long-winded way to say dim(p.s) is 4,000 x 32. We have to work with those 4,000 posterior predictions for every scenario.

We calculate the probability of future scenarios having MPGs greater than 25 in exactly the same was in rstanarm as with MCMCregress.pred.

No need to be coy about the comparisons. We can put both predictions on one plot (and the plot I used to head the post).

plot(x$disp,q.m,ylab='Pr(MPG>g|old data,M)',main='MCMC vs. Stan')
  points(x$disp,q.s,pch=2,col=1)
  grid()
  legend('topright',c('MCMC','Stan'), pch=1:2, col=1:2, bty='n')

There are differences, but all are such that, I'm guessing, no decision would be changed using one model for the other ---- at least decision point, i.e. at MPG> 25.

You must always remember that it is decision that counts in deciding between good and bad models, whether a measure (an 'x') goes into the model, etc.

The dataset has 9 (or whatever) other measures, like horsepower. You can play with those. rstanarm makes it easy to put in custom scenarios, which is done in pretty much the same way as before. For example:


y = x[1,] # take as scenario the first observation
y$hp = 400 # bump up the horsepower to some gigantic level
p.s = posterior_predict(fit.s, newdata=y)
hist(p.s,100,xlab='MPG')

That's the distribution of possible MPGs for the scenario y. Look at y so you know what it is. Change values to your taste and compare outputs. The histogram took advantage that the new data only had one row. If there is more than one row, you have to index p.s (i.e., p.s[,i]).

You'll see there is already probability leakage with this scenario, but not much; only about 1% (sum(p.s<0)/4000). Your homework is to discover, for reasonably configured cars, you can find scenarios where the leakage is substantial. If it exists. I don't know: I didn't check.

This means as it did before: the normal model is breaking down sometimes. Well, we always knew it was impossible that MPG was "normally distributed". Nothing in the world is "normally distributed". Nothing "has" a normal distribution, or any other kind of distribution. We only use these models to quantify our uncertainty in values of observables.

Before we quit, let's note a neat feature of rstanarm. Mixing code and output:


> p.s = predictive_interval(fit.s)
> p.s
                           5%      95%
Mazda RX4           17.029272 27.77522
Mazda RX4 Wag       16.849237 27.30586
Datsun 710          21.024821 31.37435
Hornet 4 Drive      15.768537 26.32141
Hornet Sportabout   12.460326 22.79026
Valiant             15.107833 25.50869
Duster 360           9.098461 19.97714
Merc 240D           17.095323 27.90923
Merc 230            18.379516 30.58885
Merc 280            13.046051 24.22801
Merc 280C           13.786114 24.79934
...

The predictive_interval() first calls posterior_predict() and then does a sort of quantile() on the columns of the output (of each scenario). The default is a 90% interval, as you can see. You can certainly write your own quantile() function if you want more than just the interval.

That's it! Next we quickly do logistic regression, which I'd skip, except that there is a small twist with rstanarm. Then we move to so-call beta-regression. That's a modle type MCMCpack doesn't have. On the other hand, rstanarm doesn't have my favorite multinomial regression. Proving only that no one package exists to do all you want.

5 Comments

JohnK

September 19, 2018, 2:39 pm

> If you are a student in need of a problem, boy howdy, have I got some good ones.
Matt has made a convincing case that replacing the continuous-parameter based models that are known to current statistics with finite discrete models will permanently improve statistics as a field.

So how about listing those problems and offering a Swell Prize (a bounty) for solving each one?

OK, the Fields Medal is in a different universe. But bounties for math problems are a long tradition. And it’s mathematicians, right? If the problem is intriguing, it’s a fact that they’ll work cheap.

I’m making an educated guess that neither Matt nor I would be good candidates to make the following come true. But somebody can!

To say it again: replacing the continuous-parameter based models that are known to current statistics with finite discrete models will permanently improve statistics as a field.

In essence:
1. Collect some prize money (through donations).
2. Very important. Get some publicity. Make it widely known that money is being collected for solving the following list of problems: [ ]
3. Once the money is collected, email every math department in the world with the list of problems and the bounty money allotted for solving it.
4. Repeat the email every year.

Ronald Graham now administers and judges the Erd?s Prizes. It’s possibly not a big deal to do such a thing fair and right.

“Cash for Math: The Erd?s Prizes Live On”
https://www.quantamagazine.org/cash-for-math-the-erdos-prizes-live-on-20170605/

“Over the years it was kind of a habit he had to say, ‘Here’s a nice problem, I thought about it for a while, and I don’t see how to solve it. Maybe it’s a $25 problem or possibly a $100 problem,’” said Ronald Graham, a mathematician at the University of California, San Diego, and a longtime friend of Erd?s’s.
McChuck

September 21, 2018, 6:11 am

A little statistical humor at XKCD.
https://xkcd.com/2048/
Tim Simmons

September 22, 2018, 3:52 pm

These applied posts are great. Glad there are more in the way.

On the subject of finite predictive models, I extended the finite predictive model to multiple categories last summer. (Briggs, I could send you the calculations if you’re interested.) The really hard problem is working with “continuous” measurements, since it’s hard to say just what it is we assume about them that makes them different from categorical measurements.
Briggs

September 22, 2018, 6:33 pm

Tim,

Exactly so. And the answer is: there is no difference, they are all categorical. It’s just that some measures have lots more categories than others. Continuity is always an approximation. Be happy to see what you’ve done.
Pingback: How To Do Predictive Statistics: Part X: Survival Analysis – William M. Briggs

How To Do Predictive Statistics: Part VIII — Starting Stan

Related

5 Comments

Leave a Reply

Share this:

Related

5 Comments

Leave a Reply