## How To Do Predictive Statistics: Part V New (Free) Software Multinomial Regression

*Previous post in the series (or click above on Class). REVIEW!*

Download the code: `mcmc.pred.R`

, `mcmc.pred.examples.R`

. If you downloaded before, download again. This is version 0.22! Only the example code changed since last time.

For an explanation of the theory behind all this, which is markedly different than classical views, get this book: *Uncertainty*.

**Mandatory rant**

We’ll use another built-in dataset, the Netherlands voting data. Accessed like this:

data(Nethvote)

x = Nethvote

As before, assigning it to `x`

is for simplicity sake. Find out all about the data with `?Nethvote`

. Essentially, voters could pick one of several parties. And, supposing those parties last until the NEXT election, and given some demographic information, we want the probability that

Pr(Y | new X, old X&Y, Model & Assumptions)

Which—surprise!—is identical to the same probability we want in every predictive model! The emphasis on the NEXT election cannot be stressed too highly. Why? Glad you asked. Let me shout the answer:

**There is NEVER a need to model what happened, only what might happen**.

We do not need a probability model to tell us what we saw. We need only use our eyes. If we want to know if more religious people voted for vote (i.e. party) = CDA WE JUST LOOK. There is no need to do a “hypothesis test”, which is insane. Either more religious would have voted for CDA, or they wouldn’t have. AND THAT IT IS.

The classical idea, frequentist or Bayes, p-value of Bayes factor, of ascertaining whether more religious “really” voted more frequently for CDA is nuts. About the future? Well, that’s what model is for. To quantify the probability more religious will vote CDA *accepting as an assumption* religion is probative. It is *our decision* whether we choose religion as probative or not; two people looking at the same data, and even same model, can come to *different conclusions*.

I shout, because these ideas are central to the (old, ancient) predictive approach. They are foreign to the hypothesis testing classical methods, which aim to uncover occult forces in data. We will have none of that. Our concern is only observables and measures.

On to the data!

**The real beginning**

Because of a limitation (out of my control) of `MCMCmnl`

, we have to keep track of the model formula. So we call the method a little differently than when we did ordinary or logistic regression.

form = formula('vote ~relig + class + income + educ + age * urban')

```
```lv = levels(x[, as.character(form[[2]]) ])

`fit = MCMCmnl(form, mcmc.method="IndMH", B0=0, mcmc=5000, thin=10, tune=0.5, baseline='D66', data=x)`

Notice `form`

is a standard R formula. This one was chosen to match the one in the native help function. Fool around with others. The object `lv`

holds the levels of the “y” observable. It’s written in a generic way, so that it works with any data set. We could have, of course, just wrote `lv = levels(x$vote)`

, but that works only on data frames with vote as an outcome. Notice, too, that we can change the baseline. We don’t have to: it will default to the normal R base level. We keep track of the levels because you’re allowed to change them, and `MCMCmnl`

doesn’t save the model formula. Ah, well.

Predictions are somewhat different than before, too. We have pass in the model formula and levels of the y. We also need, as ever and as core of the predictive method, a scenario. How about this one? Mixing code and output, and ignoring the ‘dist’ measures, which we don’t use.

x[1,]

```
```

`vote distD66 distPvdA distVVD distCDA relig class income educ age urban`

PvdA 2.669695 2.335121 4.109881 6.45008 0 0 1 2 5 1

Then

p = MCMCmnl.pred(fit,form,x[1,],lv)

p

I get

```
> p
D66 CDA PvdA VVD
0.076580 0.067476 0.822900 0.033044
```

So, given non-region, class of 0, and so on, the probability a NEW voter will go D66 is about 8%. Your results will vary a bit, since as ever this is a numerical approximation. But they’ll be close. The most likely vote will be cast at 82% is for PvdA for NEW voters of this sort, and the least likely is VVD at 3%. I don’t know Dutch politics, so I offer no opinions on what this means.

The idea, if it isn’t clear, is that you get a probability for each possible category, because why? Because that’s what we wanted!

The `form`

and `lv`

ensure everything is labeled correctly at the end. Pain in the keister. But as yet there are no wrappers for any of these methods to make things easier.

How about all the scenarios in the data? You bet:

```
p = MCMCmnl.pred(fit,form,x[1,],lv)
for(i in 1:nrow(x)){
# this preserves the proper names for p's columns
if(i>1) p=rbind(p,MCMCmnl.pred(fit,form,x[i,],lv))
}
p = as.data.frame(p, row.names=FALSE)
par(mfrow=c(2,2))
for (i in 1:4){
plot(x$class,p[,i],main=names(p)[i], ylab='Pr(Vote|D,M)',col=x$relig+1)
}
```

Notice we stacked the answers one on top of the other, and turned `p`

into a data.frame. The plot is for each category or level of vote, as a function of class (which really does have all those odd values; probably the output of some other model). For fun, I colored the points by religion yes/no.

This is only one possible plot of many. Other obvious ones will suggest themselves to you. Do them as homework.

Everything is more complex because the model itself is more complex. There isn’t any real or general way to make this easy, either. Nor should there be!

“But, Briggs, can’t I do an average probability for each class level, using all the old scenarios? That way I can tell the impact of ”

Sure you can. But why would say *impact* when you meant *influence*? Second, it would be fooling yourself. Because your model included all those other things, you have to state probability *only with regard to and conditional on all those other things*. Otherwise you’re talking weird.

If you want to discuss only class, then build a model with only class.

```
form = formula('vote ~ class')
lv = levels(x[, as.character(form[[2]]) ])
fit = MCMCmnl(form, mcmc.method="IndMH", B0=0, mcmc=5000, thin=10, tune=0.5, baseline='D66',data=x)
```

Then you can say what you want about class considered only by itself. Or whatever.

The key lesson is that *you* specified a model with all those measures, so you can only speak of the model *with all those measures.* If you don’t want to speak of them, remain model-silent of them.

**Mini rant**

We are done with multinomial. But not really. It should be used in place of ordinary regression almost always. Why? Because *all* measures are discrete and finite, thus all Y are, thus all Y are better approximated by multinomials. Now, all Y are *approximated* by continuity, which is an ENORMOUS assumption, and untrue. No measure can be continuous, and none infinite in actuality.

All data should be transformed into the units of decision. We talked about this before with regard to CGPA data. If you are a dean only interested in counting numbers of students scoring 3 or hihger in CGPA (or whatever), then you have naturally created an analysis were the Y is dichotomous. Or maybe you want 3 or above, which naturally implies under 3s are of interest, and then 4s (to be given special recognition, say). Then we have a trichotom. Multinomial can handle this, ordinary regression cannot.

Two people can have the same data and come to different conclusions about it, as happens all the time in real life. People have different decisions to make, and different consequences to face about those decisions. Therefore, *every* analysis, i.e. model, should be tailored to the decision at hand. Since *every* decision, like *every* measure, is discrete and finite in act, then so should by *every* model.

“But Briggs, if I quash the data into buckets like you say, then I lose information. I won’t know the difference, in this case, between a CGPA of 2.876543 and 2.876544. I’m losing power or whatever. Besides, I’ve heard discretizing data is bad.”

You heard wrong. I remind you that *there is no difference between 2.876543 and 2.876544—not one bit! nor between 0 and 2, or 0 and 2.9—when* any *decision you make recognizes no difference between these CGPAs!* If you are going to make *different decisions*, then you will have different buckets, and thus a different model, and different results.

This is not a bug, it is a feature. Just like the conditionality of all probability.

Next is Poisson regression.