Class - Applied Statistics

How To Do Predictive Statistics: Part V New (Free) Software Multinomial Regression

Previous post in the series (or click above on Class). REVIEW!

Download the code: mcmc.pred.R, mcmc.pred.examples.R. If you downloaded before, download again. This is version 0.22! Only the example code changed since last time.

For an explanation of the theory behind all this, which is markedly different than classical views, get this book: Uncertainty.

Mandatory rant

We’ll use another built-in dataset, the Netherlands voting data. Accessed like this:

data(Nethvote)
x = Nethvote

As before, assigning it to x is for simplicity sake. Find out all about the data with ?Nethvote. Essentially, voters could pick one of several parties. And, supposing those parties last until the NEXT election, and given some demographic information, we want the probability that

     Pr(Y | new X, old X&Y, Model & Assumptions)

Which—surprise!—is identical to the same probability we want in every predictive model! The emphasis on the NEXT election cannot be stressed too highly. Why? Glad you asked. Let me shout the answer:

There is NEVER a need to model what happened, only what might happen.

We do not need a probability model to tell us what we saw. We need only use our eyes. If we want to know if more religious people voted for vote (i.e. party) = CDA WE JUST LOOK. There is no need to do a “hypothesis test”, which is insane. Either more religious would have voted for CDA, or they wouldn’t have. AND THAT IT IS.

The classical idea, frequentist or Bayes, p-value of Bayes factor, of ascertaining whether more religious “really” voted more frequently for CDA is nuts. About the future? Well, that’s what model is for. To quantify the probability more religious will vote CDA accepting as an assumption religion is probative. It is our decision whether we choose religion as probative or not; two people looking at the same data, and even same model, can come to different conclusions.

I shout, because these ideas are central to the (old, ancient) predictive approach. They are foreign to the hypothesis testing classical methods, which aim to uncover occult forces in data. We will have none of that. Our concern is only observables and measures.

On to the data!

The real beginning

Because of a limitation (out of my control) of MCMCmnl, we have to keep track of the model formula. So we call the method a little differently than when we did ordinary or logistic regression.

form = formula('vote ~relig + class + income + educ + age * urban')

lv = levels(x[, as.character(form[[2]]) ])

fit = MCMCmnl(form, mcmc.method="IndMH", B0=0, mcmc=5000, thin=10, tune=0.5, baseline='D66', data=x)

Notice form is a standard R formula. This one was chosen to match the one in the native help function. Fool around with others. The object lv holds the levels of the “y” observable. It’s written in a generic way, so that it works with any data set. We could have, of course, just wrote lv = levels(x$vote), but that works only on data frames with vote as an outcome. Notice, too, that we can change the baseline. We don’t have to: it will default to the normal R base level. We keep track of the levels because you’re allowed to change them, and MCMCmnl doesn’t save the model formula. Ah, well.

Predictions are somewhat different than before, too. We have pass in the model formula and levels of the y. We also need, as ever and as core of the predictive method, a scenario. How about this one? Mixing code and output, and ignoring the ‘dist’ measures, which we don’t use.


x[1,]

vote distD66 distPvdA distVVD distCDA relig class income educ age urban
PvdA 2.669695 2.335121 4.109881 6.45008 0 0 1 2 5 1

Then

p = MCMCmnl.pred(fit,form,x[1,],lv)
p

I get


> p
     D66      CDA     PvdA      VVD 
0.076580 0.067476 0.822900 0.033044 

So, given non-region, class of 0, and so on, the probability a NEW voter will go D66 is about 8%. Your results will vary a bit, since as ever this is a numerical approximation. But they’ll be close. The most likely vote will be cast at 82% is for PvdA for NEW voters of this sort, and the least likely is VVD at 3%. I don’t know Dutch politics, so I offer no opinions on what this means.

The idea, if it isn’t clear, is that you get a probability for each possible category, because why? Because that’s what we wanted!

The form and lv ensure everything is labeled correctly at the end. Pain in the keister. But as yet there are no wrappers for any of these methods to make things easier.

How about all the scenarios in the data? You bet:


p = MCMCmnl.pred(fit,form,x[1,],lv)
for(i in 1:nrow(x)){
  # this preserves the proper names for p's columns
  if(i>1) p=rbind(p,MCMCmnl.pred(fit,form,x[i,],lv))
}
p = as.data.frame(p, row.names=FALSE)

par(mfrow=c(2,2))
for (i in 1:4){
    plot(x$class,p[,i],main=names(p)[i], ylab='Pr(Vote|D,M)',col=x$relig+1)
}

Notice we stacked the answers one on top of the other, and turned p into a data.frame. The plot is for each category or level of vote, as a function of class (which really does have all those odd values; probably the output of some other model). For fun, I colored the points by religion yes/no.

This is only one possible plot of many. Other obvious ones will suggest themselves to you. Do them as homework.

Everything is more complex because the model itself is more complex. There isn’t any real or general way to make this easy, either. Nor should there be!

“But, Briggs, can’t I do an average probability for each class level, using all the old scenarios? That way I can tell the impact of ”

Sure you can. But why would say impact when you meant influence? Second, it would be fooling yourself. Because your model included all those other things, you have to state probability only with regard to and conditional on all those other things. Otherwise you’re talking weird.

If you want to discuss only class, then build a model with only class.


form = formula('vote ~ class')
lv = levels(x[, as.character(form[[2]]) ])
fit = MCMCmnl(form, mcmc.method="IndMH", B0=0, mcmc=5000, thin=10, tune=0.5, baseline='D66',data=x)

Then you can say what you want about class considered only by itself. Or whatever.

The key lesson is that you specified a model with all those measures, so you can only speak of the model with all those measures. If you don’t want to speak of them, remain model-silent of them.

Mini rant

We are done with multinomial. But not really. It should be used in place of ordinary regression almost always. Why? Because all measures are discrete and finite, thus all Y are, thus all Y are better approximated by multinomials. Now, all Y are approximated by continuity, which is an ENORMOUS assumption, and untrue. No measure can be continuous, and none infinite in actuality.

All data should be transformed into the units of decision. We talked about this before with regard to CGPA data. If you are a dean only interested in counting numbers of students scoring 3 or hihger in CGPA (or whatever), then you have naturally created an analysis were the Y is dichotomous. Or maybe you want 3 or above, which naturally implies under 3s are of interest, and then 4s (to be given special recognition, say). Then we have a trichotom. Multinomial can handle this, ordinary regression cannot.

Two people can have the same data and come to different conclusions about it, as happens all the time in real life. People have different decisions to make, and different consequences to face about those decisions. Therefore, every analysis, i.e. model, should be tailored to the decision at hand. Since every decision, like every measure, is discrete and finite in act, then so should by every model.

“But Briggs, if I quash the data into buckets like you say, then I lose information. I won’t know the difference, in this case, between a CGPA of 2.876543 and 2.876544. I’m losing power or whatever. Besides, I’ve heard discretizing data is bad.”

You heard wrong. I remind you that there is no difference between 2.876543 and 2.876544—not one bit! nor between 0 and 2, or 0 and 2.9—when any decision you make recognizes no difference between these CGPAs! If you are going to make different decisions, then you will have different buckets, and thus a different model, and different results.

This is not a bug, it is a feature. Just like the conditionality of all probability.

Next is Poisson regression.

6 replies »

  1. Briggs,

    Your enemies are at work again, only this time they’re removing data so as to compromise the semantic integrity of your code. In the section of code towards the end where you are discussing only the class of the voter:

    form = formula(‘vote ~ class) # <— !!!! missing a closing apostrophe !!!!
    lv = levels(x[, as.character(form[[2]]) ])
    fit = MCMCmnl(form, mcmc.method="IndMH", B0=0, mcmc=5000, thin=10, tune=0.5, baseline='D66',data=x)

    Those of us who valiantly copied and pasted that chunk of code get a lovely 'Error: unexpected symbol…' as the interpreter freaks out.

    Great series so far! I wish I knew R better so I could help contribute to writing the function wrappers; it's a bit tough for me to follow at times as R is not one of my strong languages and the setup work to get your code to evaluate properly given the constraints of the given MCMCpack package creates a bit more mental overhead for me. Great posts though! You should totally rework "Breaking the Law of Averages" or distill that to a shorter version that echoes all the fantastic points made in Uncertainty while also incorporating this new foray into explaining predictive statistics via R software.

  2. Matt’s “mini-rant” above created the classic embarrassed face-palm in me. Why, just the other day I was idly wondering why so much analysis assumes continuous variables, when, as Matt has pointed out over and over, in real life, there ain’t any. Or at least, almost none.

    So I felt especially stupid to hear Matt say that OF COURSE multinomial regression is to be preferred in almost all cases. I should have thought of that, a long time ago.

    I just don’t have the time or energy anymore (and both seem to be asymptotically dwindling by the day) even to re-install R on my box, let alone think all these things through.

    Obviously.

    But NOTE: Matt could not be clearer: [Multinomial] “should be used in place of ordinary regression almost always. ”

    Yes, “probability| X,Y,model, assumptions”; don’t “model” what already happened; predict stuff and then see how you did; make your model and assumptions as explicit as you can, etc.

    But that’s a lot to remember (and even harder to do). And as Matt knows, at least as a practical matter, nearly all ‘real’ statisticians disagree with some or all of that (even if in theory they admit that they should). Just for instance, I have yet to come across the statistics text that tells beginners: ALMOST ALWAYS, USE MULTINOMIAL REGRESSION. (or anything even remotely like that).

    So perhaps as a start, the student should remember that multinomial analysis is, almost always, very, very much to be preferred. That in itself is a very important point.

  3. If you are a dean only interested in counting numbers of students scoring 3 or hihger in CGPA (or whatever), then you have naturally created an analysis were the Y is dichotomous….

    If you are interested in counting the number of students scoring 3 or more, why don’t your just count? A multinational model doesn’t count any better than another regression model appropriate for continuous response variable.

    A multinomial model gives you an estimate of the probability of each of the discrete value. To get the probability of 3 or higher, you will need to add probabilities of those discrete values that are 3 or higher. Note that a discrete value, say, 3.75 that’s not observed in the data or tailored/modify away would has a probability of zero by theory, which is how those estimates are obtained. By theory.

    Again, running a multinational model on the tempered/bucketed/modified/tailored data violates the original data structure. It is not about the silly difference between 2.876543 and 2.876544.  For example, if a GPA of 2.55 is bucketed into a bucket of 2.5 or 3, the information that GPA can be 2.55 is wasted. A (ordered or not) multinomial model will produce a zero probability of GPA=2.55 after your data modification.

  4. We are done with multinomial. But not really. It should be used in place of ordinary regression almost always. Why? Because all measures are discrete and finite, thus all Y are, thus all Y are better approximated by multinomials.

    By running a multinomial model, you are estimating/approximating the probabilities of Y values (given…), not Y.  So no, you don’t approximate $Y$ by multinomials, better or not.
    https://www.merriam-webster.com/dictionary/multinomial

    Therefore, every analysis, i.e. model, should be tailored to the decision at hand. Since every decision, like every measure, is discrete and finite in act, then so should by every model.

    You are in fact suggesting to tailor/temper/modify/bucketize the data to fit the decision at hand. 

    Just like using alternative facts to fit one’s agenda?!

    What is a finite model? A model appropriate for a response variable that can only assume finite number of values? Then Poisson model is not finite.

    I do know that the data structure dictates how one models the data appropriately.  Changing data structure also could also mean changing the underlying assumptions.

Leave a Reply

Your email address will not be published. Required fields are marked *