Skip to content

Category: Statistics

The general theory, methods, and philosophy of the Science of Guessing What Is.

February 7, 2018 | 1 Comment

Free Probability-Statistics Class: Predictive Case Study 1, Part XI

Review!

We left off with comparing the standard, out-of-the-box linear regression with our multinomial predictive observable model. The great weaknesses of the regression were probability leakage (giving positive probability to impossible values) and that the normal gives densities and not probabilities. We can’t fix the leakage with this model (it’s a built-in shortcoming of this model), but we can generate probabilities.

Now the densities are predictions from the normal regression model, but they are not in a form that can be used. In order to create probabilities from densities, we need to make a decision. The densities are of course easily transformed into cumulative distributions, which are probabilities, but they will give positive probabilities to an infinity of results (all numbers along the continuum). We only care about our decision points, which for our fictional Dean are the five values 0, 1, 2, 3, 4.

The decision we need to make is in how to cut the infinity into 5 blocks. There is of course no unique way to do this. But it may be reasonable to cut the values at the midpoints of the five values. For example, given the regression, the decision probability of having a CGPA = 0 would be the regression probability of being between 0 and 0.05. For 1, it would be 0.5 to 1.5, and so on. That’s easy to do.

Mixing code and output in the obvious way:

d=as.numeric(levels(x$cgpa))   
caps = d[-length(d)] + diff(d)/2
caps = c(min(d),caps,max(d))

# output
> caps
[1] 0.0 0.5 1.5 2.5 3.5 4.0

I used caps because R has a native cut function inconvenient here (it outputs factors, and we want numbers). Let’s next recreate the picture of the predictions for good and bad students. Then we’ll add on top of it the regression probabilities.

y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4)) #bad and good student
a=predict(fit, newdata = y, type='prob')$p #our mulitnomial moels

plot(levels(x$cgpa), a[1,],type='h',xlab='College GPA',ylab='Probability',ylim=c(0,1.1*max(a)),lwd=5)
  lines(levels(x$cgpa), a[2,],type='h',lty=2,col=2,lwd=3)
  grid()
  legend('topright', c("SAT =  400, HGPA = 1","SAT = 1500, HGPA = 4"),lty=1:2,col=1:2,bty='n',lwd=2) 
 
s1 = obs.glm(fit.n,y[1,]) # the prediction from the regression
s2 = obs.glm(fit.n,y[2,])
 
 m = s1$central-s2$central  # plotting up the probability DENSITIES, NOT PROBABILITIES
 s = sqrt(s1$scale^2+s2$scale^2)
 w = seq(min(s1$central,s2$central)-3*s, max(s1$central,s2$central)+3*s, length.out = 100)
lines(w,.7*dt((w-s1$central)/s1$scale,s1$df)/s1$scale,lty=1,lwd=2) 
lines(w,.7*dt((w-s2$central)/s2$scale,s2$df)/s2$scale,lty=2,col=2,lwd=2) 

# regression probabilities at the caps
p = matrix(0,2,length(levels(x$cgpa))) # storage
for (i in 1:2){
     b=obs.glm(fit.n,y[i,])
     d=pnorm(caps,b$central,b$scale) # the predictions
     p[i,] = diff(d) # d[1] prob < 0; 1-(sum(diff(d))+d[1]) prob >4
     lines(as.numeric(levels(x$cgpa))+.05,p[i,],col=i+2,type='h',lty=i,lwd=3)
}

Green solid is for the bad student, regression model; blue dashed is for the good student, regression model. You can see the spikes follow, more or less, the densities. The black solid is our original multinomial model for the bad student, and the red dashed the good student.

Which model is better? There is no way to tell, not with this plot. They are simply two different predictions from two different models. The only way to gauge goodness is to—all together now—wait for new data which has never been seen or used in any way, and then compare both models’ predictions against what happened.

Nevertheless, we can gather clues and build another, different predictive model, which predicts how well our original models will perform. Be sure you understand the distinction! We have an original predictive model. Then we create a model on how well this original predictive model will perform. These are not the same models! This right here is another major departure of the predictive method over classical counterparts.

Let’s first look at the regression predictions a little more closely (mixing code and output):

> p
             [,1]        [,2]       [,3]        [,4]         [,5]
[1,] 2.152882e-01 0.568404758 0.12265228 0.002511169 4.024610e-06
[2,] 1.109019e-06 0.001036698 0.07554629 0.511387677 2.646423e-01

> rowSums(p)
[1] 0.9088604 0.8526141

The first row is the bad student (low SAT and HGPA) at each of the five CGPAs (here labeled by their matrix notation), and the second the good student (high SAT and HGPA). The rowSums(p) gives the total probability of all possibilities. This should be 1 (right?). It isn’t; it is less, and that is because of probability leakage.

You can see that leakage is conditional on the assumptions, just like probability. Leakage isn’t constant. (It also depends on how we define our caps/cut points.) We earlier guessed, looking only at the densities, that it wasn’t that bad. But it turns out to be pretty large. We’re missing about 10% probability from the first prediction, and about 15% from the second. These are the probabilities for impossible events.

The leakage does not mean the model is useless, but it is evidence the model is sub optimal. It also does not mean our multinomial model is better, but at least our multinomial model does not have leakage. The multinomial model, for instance, can give large probabilities to events that did not happen, while the leakage regression model gives decent probabilities to what happened. We’ll have to see.

And that’s what we’ll do next time. It’s too much to try and begin formal verification this week.

Homework Assume the Dean wants to do CGPAs at 0, 0.5, …, 3.5, and 4. Rerun the code from the beginning, and end with the plot seen above, complete with the regression model using the obvious cut points. Notice anything different?

January 31, 2018 | 4 Comments

Free Probability-Statistics Class: Predictive Case Study 1, Part X

Review!

Last time we created four models of CGPA. Which is correct? They all are. Why? I should ask as a homework question, but I’ll remind us here. Since all probability is conditional, and these are all ad hoc models, all are correct, given the assumptions. Which one is best? Depends on the decisions you want to make and on what you mean by “best.” Let’s see.

We discovered last time that by itself HGPA was relevant to CGPA, but not by much. SAT was also by itself relevant. The contour plots (which I have decided not to redo, since we did them last week) showed that SAT and HGPA when considered together are also relevant. We also created the “null” model (remember our terminology does not match the classical usage) which only used past data (and grading rules, etc.). We now have to see how useful each of these models are. (If you can’t remember what all these terms mean — review!)

In one sense, we cannot do too much more because we have the models and have made predictions using them. That was our goal, remember? Now we have to sit back and wait for new values of HGPA/SAT and CGPA come in. Then we can see how each of these models’ predictions match reality. This is the True Test.

Here, for instance, is an excerpt of predictions of the full model (HGPA/SAT) (I’m assuming we left off just where we were last time, so that y and a are still in R’s memory; if not, go over last week’s code):

> cbind(y[,1:2],round(a,2))
    sat hgpa    0    1    2    3    4
1   400    0 0.55 0.26 0.19 0.00 0.00
2   500    0 0.52 0.27 0.21 0.01 0.00
3   600    0 0.47 0.28 0.24 0.01 0.00
4   700    0 0.42 0.30 0.28 0.01 0.00
5   800    0 0.35 0.32 0.32 0.01 0.00
6   900    0 0.28 0.33 0.37 0.01 0.00
7  1000    0 0.22 0.35 0.41 0.02 0.00
8  1100    0 0.16 0.36 0.45 0.03 0.01
9  1200    0 0.11 0.36 0.48 0.04 0.01
10 1300    0 0.08 0.34 0.51 0.06 0.02
11 1400    0 0.05 0.31 0.52 0.08 0.03
12 1500    0 0.04 0.28 0.53 0.11 0.04
13  400    1 0.52 0.24 0.23 0.01 0.00
14  500    1 0.46 0.25 0.27 0.01 0.00
15  600    1 0.40 0.27 0.32 0.01 0.00
16  700    1 0.32 0.29 0.37 0.02 0.00
...
57 1200    4 0.00 0.04 0.22 0.58 0.15
58 1300    4 0.00 0.04 0.18 0.63 0.15
59 1400    4 0.00 0.04 0.14 0.66 0.15
60 1500    4 0.00 0.05 0.12 0.69 0.14

We could publish this, or the whole table, and we’d be done! Anybody could take these predictions and implement them. They wouldn’t have to know the details of your model, or of your original data. There is your bold theory, exposed for the world to see! That, after all, is how science should work.

Of course, all the shortcomings of your model will be obvious to anybody who tries to use it, too. Which, again, is just how it should be.

Make sure you understand what we’ve done so far. If you are the Dean and want to classify students into one of five CGPA buckets, we have a predictive model accounting for HGPA and SAT. But suppose you didn’t want to account for HGPA. Well, we have a model of just SAT: use that. And so on. The breakdowns of SAT (every 100) and HGPA (every 1) we used were also geared to the decision. Change the decision, change the breakdown, change the model.

In any case, this is it! This is what we wanted. We wanted to know, given the grading rules and old obs, and values of SAT and HPGA, what is the probability of having CGPA in one of the buckets. And this is what we did! We are done. All those people who wanted practical examples of the predictive way, well, here you go. In the end, it’s pretty simple, isn’t it?

But we can do two more things. (1) We can compare our predictive model (perhaps varying it here and there) with old-fashioned NHST/parameter-estimating models, and (2) we can create a new model that predicts the performance our current model. Number (2) is the really important step, but we won’t get to it today. Let’s do (1).

What model would most use in this situation? A linear regression. Here it is (mixing code and output again). The cgpa.o was the original CPGA, not classified into buckets. It is the raw score (the “o” is for original).

fit.n= glm(cgpa.o ~ sat + hgpa , data=x) # note the cgpa.o! which is the original data
summary(fit.n)

Call:
glm(formula = cgpa.o ~ sat + hgpa, data = x)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.12153  -0.44120   0.00954   0.38198   1.80356  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0881312  0.2866638  -0.307 0.759169    
sat          0.0012167  0.0003011   4.041 0.000107 ***
hgpa         0.4071133  0.0905946   4.494 1.94e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Wee p-values galore! The asterisks give the glory. Feel it. So, the classical statistician would say, SAT and HPGA are “linked to” CGPA. Researchers will speak of “statistical significance” and say “SAT drives high college grade point” and so on. How much does SAT influence (they’ll say “impact”) CGPA? Well, they might say for every increase in SAT by 1, CGPA goes up on average 0.0012 points. And so on.

Not much more would be done with this model, especially since everything is “significant”. Maybe the modeler throws in an interaction. Whatever. No matter what, this model exaggerates the evidence we have, and is in substantial error, even though it doesn’t look like it. Here’s why.

This model implies a prediction: all models imply predictions, even though they are not routinely made. It’s written in classical form, but the prediction is there, hidden away. Let’s look at it. We do so by integrating out the parameters, picking a “flat” prior, which, for this model anyway, gives us the exact same results for the parameters as the frequentist estimates.

Recall fit is our mutlinomial (like) model, including both SAT and HGPA. Let’s pick two hypothetical students, a good one and a bad one, and compare our predictive model with the prediction based on the ordinary regression. Before running this, first re-download the class code, which has been updated to include the code which calculates probability predictions from normal regression models. This is the obs.glm, which outputs a central, spread, and degrees-of-freedom parameter for the predictive distribution of the regression (this turns out to be a non-central T).

y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4)) #bad and good student
a=predict(fit, newdata = y, type='prob')$p #our mulitnomial moels

plot(levels(x$cgpa), a[1,],type='h',xlab='College GPA',ylab='Probability',ylim=c(0,1.1*max(a)),lwd=5)
  lines(levels(x$cgpa), a[2,],type='h',lty=2,col=2,lwd=3)
  grid()
  legend('topright', c("SAT =  400, HGPA = 1","SAT = 1500, HGPA = 4"),lty=1:2,col=1:2,bty='n',lwd=2) 
 
s1 = obs.glm(fit.n,y[1,]) # the prediction from the regression
s2 = obs.glm(fit.n,y[2,])
 
 m = s1$central-s2$central  # plotting up the probability DENSITIES, NOT PROBABILITIES
 s = sqrt(s1$scale^2+s2$scale^2)
 w = seq(min(s1$central,s2$central)-3*s, max(s1$central,s2$central)+3*s, length.out = 100)
lines(w,.7*dt((w-s1$central)/s1$scale,s1$df)/s1$scale,lty=1,lwd=2) 
lines(w,.7*dt((w-s2$central)/s2$scale,s2$df)/s2$scale,lty=2,col=2,lwd=2) 

Our multinomial-like model gives the spikes; the regression densities, which are not probabilities, are the curves. We’ll fix the densities into probabilities later. But it’s densities, because the normal lives on the continuum, and so does the predictive distribution from the normal (the t). Notice anything odd? The regression gives probabilities to impossible values, i.e. CGPA less than 0 and greater than 4. I call this probability leakage.

It’s not terrible here, but it does exist. It means the model is predicting impossibilities. The standard regression model is at the least inefficient. Interestingly, this leakage model cannot be falsified. It gives positive probability to events we’ll never see, but it never gives 0 probability anywhere! Falsifiability isn’t that interesting.

That’s enough for this time. Next time, we turn the densities into probabilities, and make modifications to our multinomial model.

Homework Try the normal predictions for the singular models with SAT and HGPA alone, and for the “null” model, which is had by glm(cgpa.o ~ 1 , data=x). Which has the most leakage? Are there huge differences?

Because you asked You can now download briggs.homework.R, which contains all the code used in the lectures. Note that this is different than briggs.class.R, which can be treated like black-box helper code.

January 26, 2018 | 7 Comments

Pew’s New Survey on Religious Groups Views on Abortion

Stream: Pew’s New Survey on Religious Groups Views on Abortion

According to a new survey by Pew Research, Unitarian Universalists are the strongest proponents of legal abortion, and Jehovah’s Witnesses the least supportive.

About 90% of Unitarians support abortion, beating out even atheists at 87%, though the difference is within the margin of predictive error of the 2017 survey. The variations in spiritual beliefs between many Unitarians and atheists is also not large.

The remaining top five groups with the most supporters are agnostics (87%), Jews (83%), and Buddhists (82%).

The top five least supportive groups are Jehovah’s Witnesses (18%), Church of God (29%), Assemblies of God (26%), Church of the Nazarene (27%), and Mormons (27%).

Except for Jehovah’s Witnesses, support for abortion is thus at least about 1 in 4 or larger, averaging 57% across all Americans.

Mainline Protestants in the Episcopal Church, United Church of Christ, Presbyterian Church (USA), Evangelical Lutheran Church in America, and the United Methodist Church all admit to favoring abortion at rates greater than the national average.

This is contrasted with traditional Evangelical churches, such as the Southern Baptist Convention and Churches of Christ and those already mentioned, all of which have less support than the average.

Members of the Roman Catholic Church support legal abortion at 48%.

Official positions

Pew has also compiled the official teachings on abortion from each of the religious groups. It is of interest to contrast these with what members of each group profess.

That nearly half of Catholics support legal abortion is most remarkable because the Catholic Church, alone among other groups, opposes abortion in all circumstances. The correct number of supporters, assuming parishioners are attentive to the Church’s formal rules, should be zero. The enormous discrepancy in practice can only be the result of a “law on the books” Catholic priests and bishops are largely happy to ignore.

[]

All In

Finally are those groups who endorse legal abortion in all or most cases. Most mainline Protestant churches support abortion, giving only small to tepid caveats and warning the women who want to kill their offspring to first consider the matter deeply and seriously.

The Episcopal Church is a fair representative. It says “All human life is sacred from its inception until death”, and it admits the killing has a “tragic dimension”. But then it allows abortion “should be used only in extreme situations.” This leaves its members plenty of leeway in deciding what “extreme” means. As long as “individual conscience is respected”, the Episcopal Church says go ahead.

[]

Three out of four surveyed would click here and read the rest.

January 23, 2018 | 4 Comments

Free Probability-Statistics Class: Predictive Case Study 1, Part IX

Review!

Last time we successfully computed our model (8):

    (8) Pr(CGPA = i | guesses of new measures, grading rules, old obs, model), where i = 0,…,5.

The model we selected was an ad hoc multinomial probit regression, with default prior selected by the MNP package. If we’re happy with that model, we’re done except for exploring other scenarios of SAT and HGPA. We can easily generate all possible combinations of decisionable values.

w1 = seq(400,1500,100) # sat 400 - 1500, every 100
w2 = seq(0,4,1) # hgpa 0 - 4, every 1
y = as.data.frame(expand.grid(w1,w2)) # R loves data frame
y$cgpa = "4" # this automatically creates the right number of cgpas
names(y) = c('sat','hgpa','cgpa') # these have to match spelling as in x!

a=predict(fit, newdata = y, type='prob')$p # takes a couple of seconds to run; assumed fit is in memory!

I’ve decided the decisionable values of SAT and HGPA would be every 100 and 1, but you could easily change these in the seq() function. There’s no easy way to visualize this, since we have 2 measurements taking 60 combinations (at values I chose) in 5 dimensions (CGPA = 0-4). But we can do things like this:

contour(w2,w1,matrix(a[,3],length(w2),length(w1),byrow=TRUE), xlab='SAT',ylab='HGPA', main='Pr(CGPA=2|E)')

# remember a[,1] Pr(CGPA=0|evidence), a[,2] Pr(CGPA=1|evidence), etc.

# Or for something fancier and colorful, try this
#install.packages('plotly', dependencies=TRUE) # if you don't have it; downloads many packages!
library(plotly)

plot_ly(x=w2,y=w1,z=matrix(a[,3],length(w1),length(w2)),type='contour') %>%
  layout(title = 'Pr(CGPA=2|E)',
         xaxis = list(title = 'HGPA'),
         yaxis = list(title = 'Brain Weight (g)'))

Note: Cut and paste everything from plot_ly on down.

These plot the predicted probabilities of CGPA as contours by the values of SAT and HGPA. As with everything, whether these plots are of any value depends on the decisions you will make. They are probably way too specific, especially given the evidence the numerical approximations are only “close enough.” Our knowledge of that numerical approximation is part of our model; I mean, part of the right hand side of (8). Keep that in mind!

One decision is whether SAT or HGPA are relevant, in the presence of the other. If you cannot recall the definition of relevant — review! Let’s check!

fit.null <- mnp(cgpa ~ 1, data=x, burnin = 1000, n.draws=50000)
fit.a <- mnp(cgpa ~ sat , data=x, burnin = 2000,n.draws=50000)
fit.b <- mnp(cgpa ~ hgpa, data=x, burnin = 1000, n.draws=50000)

The fit.null is the model (5) we did many lessons ago, or the form of it as a probit model with all the numerical approximation baggage. It is simply the predictive model not taking account anything but past observations, grading rules, and the model and approximations. SAT and HGPA should be relevant, or else fit.null is good enough.

# we'll reuse the same y as above; measurements not used in the model aren't used in the prediction
a.null=predict(fit.null, newdata = y, type='prob')$p
a.a=predict(fit.a, newdata = y, type='prob')$p
a.b=predict(fit.b, newdata = y, type='prob')$p

Recalling y is of length 60, here are the first 10 results:

> a.null
               0         1         2         3          4
 [1,] 0.05869388 0.1568980 0.5764082 0.1632245 0.04477551
 [2,] 0.06053061 0.1583878 0.5749388 0.1609796 0.04516327
 [3,] 0.05902041 0.1564694 0.5805918 0.1601020 0.04381633
 [4,] 0.06016327 0.1583469 0.5750408 0.1605510 0.04589796
 [5,] 0.06004082 0.1562857 0.5766327 0.1621020 0.04493878
 [6,] 0.05904082 0.1574694 0.5773265 0.1610612 0.04510204
 [7,] 0.05961224 0.1603469 0.5763673 0.1597347 0.04393878
 [8,] 0.05844898 0.1585918 0.5771224 0.1618367 0.04400000
 [9,] 0.06110204 0.1580816 0.5772041 0.1587347 0.04487755
[10,] 0.06006122 0.1548571 0.5791224 0.1607755 0.04518367

This shows you the limits of the numerical approximation technique. All of the numbers in each column should be the same, since this is the predicted probability of CGPA unconditioned on SAT and HGPA. They're not wildly variable in the context of most decisions I can think of, but the variability is there. We could average each column and reduce some of the approximation error (why this is so I won't prove, but it should be intuitive enough). Mixing code and output:

a.null = colMeans(a.null)

> a.null
         0          1          2          3          4 
0.05942313 0.15791837 0.57627857 0.16167925 0.04470068 

So, according to (8a), the probability of CGPA = 2 is highest at 58%. Column 1 of y repeats the sequence w1 length(w2) times. It does this because we were lazy and didn't want to create a new y for every model, content to use the original y with all possible (decisionable) combinations of SAT and HGPA. If you look at each of the blocks of SAT from 400 to 1500, you'll again see variability where none should ideally be. We can average the blocks to reduce variability if we want to be clever. This is getting tedious because we could have created new y, but on the other hand, we would not have seen the variability of the numerical approximation. Thus our laziness has given us (theoretically, anyway) superior guesses.

# hack hack hack
# hack, but it should work whatever levels cgpa has, and whatever breakdowns of w1 and w2
# hack hack hack

b.1 = matrix(1:dim(y)[1],length(w1),length(levels(x$cgpa)))
b.2 = b.1 *0
for(i in 1:length(w2)){
  b.2 = b.2 + a.a[((i-1)*length(w1)+1):((i-1)*length(w1)+length(w1)), ]
}
b.2 = b.2/length(w2)
a.a = b.2 # replacement happens here!

b.1 = matrix(1:dim(y)[1],length(w2),length(levels(x$cgpa)))
b.2 = b.1 *0
for(i in 1:length(w1)){
  b.2 = b.2 + a.b[((i-1)*length(w2)+1):((i-1)*length(w2)+length(w2)), ]
}
b.2 = b.2/length(w1)
a.b = b.2 # replacement happens here!

Now we use length(levels(x$cgpa)) so often, it would be better to create a variable for it, and use it in all the code. I didn't. You do it as homework. As I said at the outset, this code is not optimized. It is written to be plain and simple to follow. Now let's check relevance of SAT!

plot(w1,a.a[,1],type='b',col = 1, xlab='SAT', ylab='Pr')
for (i in 2:length(levels(x$cgpa))){
 lines(w1,a.a[,i],type='b',col = i)
}
points(rep(w1[1],length(levels(x$cgpa))), a.null,pch=15, col=1:length(levels(x$cgpa)),cex=2 )
legend('topright',levels(x$cgpa),lty=1,col=1:length(levels(x$cgpa)), bty='n')

This shows the predicted probabilities of each level of CGPA for the decisionable levels of SAT we picked. It also shows, at the far left in colored boxes corresponding to the colored lines, the predicted probabilities of the null model. The box probabilities differ from the line probabilities, so SAT is relevant. It is also in the causal chain, as we decided earlier. So this model, call it (8b) is different from (8a), the null model. (Note I do not use "null" in quite the same way as classical statistics.) The last check is whether the probabilities are different enough that our decisions would change. I say they are, because I am the decision maker. You might say, with your decisions in mind, the differences are not big enough. There is no universal magic number!

Now HGPA:

plot(w2,a.b[,1],type='b',col = 1, xlab='HGPA', ylab='Pr',ylim=c(0,.6))
for (i in 2:length(levels(x$cgpa))){
  lines(w2,a.b[,i],type='b',col = i)
}
points(rep(w2[1],length(levels(x$cgpa))), a.null,pch=15, col=1:length(levels(x$cgpa)),cex=2 )
legend('topright',levels(x$cgpa),lty=1,col=1:length(levels(x$cgpa)), bty='n')

Notice I altered the default y-axis limits. The line probabilities are different from the boxed ones. So HGPA is relevant. But they aren't terribly different. Regardless of the value of HGPA the probabilities for CGPA are mostly flat. The only "big" discrepancy (where "big" relates to a decision) is the probability of CGPA = 2. That difference alone is decision-worthy, to coin a phrase. But it makes me wonder how useful HGPA is with SAT in the model. Back to the contour plots!

Which we'll do next time.

Homework See if you can figure out how to make the 2-D relevance plots. Remember the one above is only at one level of CGPA. Is HGPA still relevant? Yes. But would you, putting yourself in the mind of a Dean, find it useful to decisions?