# Category: Class – Applied Statistics

March 6, 2018 | 35 Comments

## The Solution To The Doomsday Argument

Today, a classic re-post which deserves to be far better known.

The Doomsday Argument! No, not global warming. The one that predicts the total number of humans who will ever live. It’s also called the Carter catastrophe; the same Carter famous for the anthropic principle. Here’s the Wiki article (HT to reader Nate West).

To solve this problem, the only rule we need is this: All probability is conditional—and conditional only on the information provided. The idea is that you’re born, you notice your birth, and you reason that your place in the order of all human births is nothing special. From that, can we conclude how many more of us we expect? This situation is analogous, at first, to balls in a bag.

Our evidence is X = “There are N balls labeled 1 through N in a bag, from which only one will be removed.” The probability of Y = “The ball will have label j, where j is from 1 to N inclusive” is 1/N, via the statistical syllogism. We deduce via the language used that N is finite (no bag can hold an infinite amount of any real thing).

Reach into the bag and pull out the ball B. It will have a label; call it B = j. Our evidence is now augmented: we have in toto X’ = “X and The ball has label j”. What can we say about N? Well, given X’, the probability N is less than j is 0, and the probability N is at least j is 1, both of which are obvious. But what about these interesting and relevant probabilities (both given X’, naturally): “N equals j”, “N is greater than j”?

We do not know.

Why? Because there is no information in X or X’ about the possible values of N, except that N must be at least equal to j (given X and not X’), information which is deduced. Now mentally you might add information that is not provided, by, say, thinking to yourself, “This j is awfully low and that’s such a big bag; therefore, surely N is large.” Or “I know this Briggs, who is a trickster. He made the bag big on purpose. N is small.” Or anything, endlessly. None of these additions are part of the problem (the stated evidence), however, and all such moves are “illegal” in probability. You cannot use information not provided. It is against the law!

Now suppose we legally augment our X and, for fun, say that N is some number in the set S. We don’t need to know much about S, except that it exists, is finite, and contains only natural numbers. Thus, X now equals “There are N balls labeled 1 through N in a bag, from which only one will be removed; and N is a number in the set S.” Given X, the probability “N is s_i (one of the set S)” is 1/#S, where “#S” stands for the number of elements in S (its cardinality, if you like); thus, the probability “N = s_i” is 1/#S, where I’ll assume the s_i are increasing in i. What about the probability that the ball withdrawn has label j? Here it gets tricky, so let’s be careful.

The key lies in realizing the bounds of j are between 1 and the largest value of S. First suppose N = s_1. We want:

Pr(B = j | N = s_1, X).

This is 1/s_1 for j = 1 to s_1, and 0 for all those j up to s_I (the largest value of S). Now

Pr(B = j | N = s_2, X)

equals 1/s_2 for j = 1 to s_2, and 0 for all values up to s_I. From this, we notice we have to be careful about specifying j precisely. From total probability we know

Pr(B = j | X ) = Pr(B = j | N = s_1, X) * Pr(N=s_1|X) + … + Pr(B = j | N = s_I, X) * Pr(N=s_I|X)

and where knowledge of j is relevant to the probability. If j = 1, then

Pr(B = 1 | X ) = [(1/s_1) + … + (1/s_I)] * (1/#S)

but if j a number larger than, say, s_1 but smaller than s_2, then (call this j’)

Pr(B = j’ | X ) = [0 + (1/s_2) + … + (1/s_I)] * (1/#S)

and so forth for other j (don’t forget S is known).

The ball is withdrawn and B = j. Can we now say anything more about N? As before, there is 0 probability N is less than j, and so if j is greater than some s_i, there is 0 probability N equals those s_i. We can do more, using the good reverend’s rule, but it’s still tricky:

Pr(N = s_i | B = j, X) = Pr( B = j | N = s_i, X) * Pr( N = s_i | X) / Pr( B = j | X).

First suppose j = 1, then

Pr(N = s_i | B = 1, X) = [(1/s_i) * (1/#S)] / ([(1/s_1) + … + (1/s_I)] * (1/#S))

= (1/s_i) / [ 1/s_1 + 1/s_2 + … + 1/s_I]

If you stare at that fraction for a moment, and recalling that the s_i are given in increasing number, you realize that values of smaller N are more probable than larger values. As a for-instance, suppose S = {20,21,…,40}, which has cardinality 21. Given X, the probability “B = 1″ is (1/20 + 1/21 + … + 1/40) * (1/21) = 0.02761295. Thus Pr(N = 20 | B = 1, X) = 0.04416451, Pr(N = 21 | B = 1, X) = 0.04206144, etc. out to Pr(N = 40 | B = 1, X) = 0.01472150. Notice that these probabilities do not change for j between 1 and 20.

In this same example, next let j = 21, then

Pr(N = s_i | B = 21, X) = Pr( B = 21 | N = s_i, X) * Pr( N = s_i | X) / Pr( B = 21 | X).

For “N = 20″, the first term on the right equals 0, and so Pr(N = s_i | B = 21, X) = 0, as desired. For “N = 21″, we have

Pr(N = 21 | B = 21, X) = Pr( B = 21 | N = 21, X) * Pr( N = 21 | X) / Pr( B = 21 | X).

= [ (1/21) * (1/21) ] / ([0 + 1/21 + 1/22 + … + 1/40] * (1/21))

= (1/21) / [0 + 1/21 + 1/22 + … + 1/40] = 0.06994537,

and for Pr(N = 22 | B = 21, X) = 0.06676604, out to Pr(N = 40 | B = 21, X) = 0.03672132.

Collecting all these tidbits leads to the conclusion that smaller (but not impossible) values of N are always more likely than larger, regardless of the value of j. Why? That’s easy. Before we see B, the possible values of N are s_1, s_2, and so on up to S_I, each equally likely. After we see B, some values of N (from S) might now be impossible, but since j will always be less than any remaining possible larger members of S, smaller values of N are closer to j than larger, thus smaller values are more likely. Simple as that.

What does this have to do with Doomsday? Everything. The crucial step was in conjuring the set S. Where did that come from? I made it up. S was known throughout second part of the calculations and unknown through the first part. When S was unknown, N was unknown, and there was nothing we could say about N except that it had to be as large as j. I mean nothing in its literal, logical sense. In that case, given only that you witness your birth order, your B = j that is, we are blind about the future of humanity.

When S was known, we had a rough idea of what N was, which we tightened slightly by learning where N might not be (by removing the ball). But for an S with large cardinality, we aren’t learning much by viewing B. S is what we started with, and something very like S is what we ended with. But this is cheating because I made the S up. We wanted N, of which we are ignorant, and then we pretend we know an S that tells us something but not everything about N! All the other solutions to the Doomsday argument I have seen also make up S, but then they add an extra layer of cheating. We posited a discrete finite S, from which deduced that N might equal any of its members with equal probability (before seeing B). But those who conjure up more creative S often fix the set so that smaller values of S are more likely (hence smaller values of N are more likely, even before we see B). Some form of exponential “distribution” for S is popular. Some even use non-probability arguments (called “improper priors”), which is triply cheating.

Once S is fixed, however it is fixed, the calculations flow in the same manner as above, but it’s easy to see that smaller values of N are always going to be more likely than larger, and that’s because the j will always be smaller (or no greater) than the maximum value of S. And given that some let S toodle out to infinity, it’s no shock at all to discover that N is not expected to be big.

Thus the Doomsday Argument is really a non-problem which includes its own answer in its formulation, which is cheating. Of course, it makes perfect sense to ask the question of how many of us there will be left, but trying to discover the answer using only your birth order is doomed to failure (beyond proving that N must be at least as large as j). Since all probability is conditional on only the information supplied, many different answers for our future numbers are possible. It’s easy to think of probative information: demographics, politics, epidemics, apocalypses (rocks from the sky, Christ’s return, etc.), and on and on. (Of course, some of these sets of information may lead to the guesses people have made about S.) I do not (now) have a good answer how to use these to put uncertainty on (the real) N.

Update Bayes’s theorem isn’t all that.

The difficulty lies in misunderstanding Bayes’s theorem, which some mistakenly write like this:

Pr(N = s_i | B = j) = Pr( B = j | N = s_i) * Pr( N = s_i ) / Pr( B = j ),

where the evidence about N in X is left off (finding the denominator is no problem because Pr( B = j ) = SUM_i Pr( B = j | N = s_i) * Pr( N = s_i )). Pr( N = s_i ) is thus “naked” (and violates the rule that all probability is conditional), yet users of Bayes’s theorem are trained to posit “priors” like this, and so posit one they do. It seems, say critics of the theory, that these priors are pulled from thin air. The critics are right. It’s completely arbitrary to conjure a Pr( N = s_i ), and so the resulting Pr(N = s_i | B = j) cannot be trusted. (I have much more about this kind of thing in my forthcoming book.)

Of course, I made up my own “prior”, but referenced as being a deduction from X. The probability Pr(N = s_i | B = j, X) is thus true. The attention then focuses on X, where it belongs. Why this X? No reason at all. If we’re after the best information about N, that is what should go into X. But it has to be information that is not N itself, like my S was. My S was merely a presumption that I already knew a lot about N; it was N by proxy, but a fuzzy proxy. Cheating, like I said.

It’s not Bayes’s theorem that’s the problem. It works just fine when we supplied information in X about S. But it also worked dandy when X was just “There are N balls labeled 1 through N in a bag, from which only one will be removed.” I didn’t display the equation at the time, but it’s there. I’ll leave it as homework for you to show.

Update I’m graduating a comment I made in reply to Steve Brookline to the main post, because it highlights what I think is the central error people make in the DA. SB’s comments should be examined for orientation. I’m repeating them here in concise form.

A standard application of the DA starts by asking for this:Pr(N < 20j) (the 20 comes from the magic number in statistics). Note the missing conditions. Accepting the bare notation, then Pr(N < 20j) = Pr(N/20 < j) = Pr(j > N/20) = 1 – Pr(j <= N/20) = 1 – 0.05 = 0.95. It is said Pr(j <= N/20) = 0.05 because j is “uniform” or is “uniformly distributed”, as if probability has life. The fatal error has been made, because we notice that this result appears to hold regardless what value N or j has. But there just is no such thing as “Pr(N < 20j)”.

We have to be careful with the notation. There is no such thing as unconditional probability, and when you drop the conditions, which often makes manipulating the equations easier, you run the risk of introducing error, which is what happens in the standard doomsday argument. Here’s what we want.

Pr(N < 20*j | B = j, X) = Pr(B = j | N < 20*j, X) * P(N < 20*j | X) / Pr(B = j | X).

(For why we want this, see SB’s comments.) Now X can be anything relevant; it as least says there are balls 1 through N, but it must also say something about N (directly or implied).

Suppose X contains information that N is in the set {1, 2, …, 19}. Then Pr(N < 20*j |X) = 1 for any j. Never forget j runs from 1 to N, which is where things go awry: j is (in the classical language) dependent on N; in the new (and proper) language, knowledge of N is relevant to knowledge of j.

This is it: it appears, because of loose notation, many forget that j and N are related. Steve used the notion of cutting a string; but of course, that can only be done quantumly (i.e. discretely), so the example is the same. Knowledge of the place j where you cut depends on knowledge of the length of the string N, and vicesy versey.

You can work it out, but the result is the right-hand-side is 1/1, and thus Pr(N < 20*j | B = j, X) = 1, as expected. So right here is all the proof I need to show that at least one “prior” on N ruins that 95% finding.

Here’s another one. Suppose X says N = 20. Then Pr(N < 20*j |X) = 0 for j = 1, and Pr(N < 20*j |X) = 1 for j > 1. Again, you can work it out, but it amounts to the same thing, that Pr(N < 20*j | B = j, X) = 0 when j = 1, else it equals 1 for all other j.

Again, suppose X says N is in set {20, 21, …, 40}. Starts to get interesting. I leave this one as a homework, too.

More about the DA is in my book Uncertainty.

February 7, 2018 | 1 Comment

## Free Probability-Statistics Class: Predictive Case Study 1, Part XI

Review!

We left off with comparing the standard, out-of-the-box linear regression with our multinomial predictive observable model. The great weaknesses of the regression were probability leakage (giving positive probability to impossible values) and that the normal gives densities and not probabilities. We can’t fix the leakage with this model (it’s a built-in shortcoming of this model), but we can generate probabilities.

Now the densities are predictions from the normal regression model, but they are not in a form that can be used. In order to create probabilities from densities, we need to make a decision. The densities are of course easily transformed into cumulative distributions, which are probabilities, but they will give positive probabilities to an infinity of results (all numbers along the continuum). We only care about our decision points, which for our fictional Dean are the five values 0, 1, 2, 3, 4.

The decision we need to make is in how to cut the infinity into 5 blocks. There is of course no unique way to do this. But it may be reasonable to cut the values at the midpoints of the five values. For example, given the regression, the decision probability of having a CGPA = 0 would be the regression probability of being between 0 and 0.05. For 1, it would be 0.5 to 1.5, and so on. That’s easy to do.

Mixing code and output in the obvious way:

```d=as.numeric(levels(x\$cgpa))
caps = d[-length(d)] + diff(d)/2
caps = c(min(d),caps,max(d))

# output
> caps
[1] 0.0 0.5 1.5 2.5 3.5 4.0
```

I used `caps` because R has a native `cut` function inconvenient here (it outputs factors, and we want numbers). Let’s next recreate the picture of the predictions for good and bad students. Then we’ll add on top of it the regression probabilities.

```y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4)) #bad and good student
a=predict(fit, newdata = y, type='prob')\$p #our mulitnomial moels

plot(levels(x\$cgpa), a[1,],type='h',xlab='College GPA',ylab='Probability',ylim=c(0,1.1*max(a)),lwd=5)
lines(levels(x\$cgpa), a[2,],type='h',lty=2,col=2,lwd=3)
grid()
legend('topright', c("SAT =  400, HGPA = 1","SAT = 1500, HGPA = 4"),lty=1:2,col=1:2,bty='n',lwd=2)

s1 = obs.glm(fit.n,y[1,]) # the prediction from the regression
s2 = obs.glm(fit.n,y[2,])

m = s1\$central-s2\$central  # plotting up the probability DENSITIES, NOT PROBABILITIES
s = sqrt(s1\$scale^2+s2\$scale^2)
w = seq(min(s1\$central,s2\$central)-3*s, max(s1\$central,s2\$central)+3*s, length.out = 100)
lines(w,.7*dt((w-s1\$central)/s1\$scale,s1\$df)/s1\$scale,lty=1,lwd=2)
lines(w,.7*dt((w-s2\$central)/s2\$scale,s2\$df)/s2\$scale,lty=2,col=2,lwd=2)

# regression probabilities at the caps
p = matrix(0,2,length(levels(x\$cgpa))) # storage
for (i in 1:2){
b=obs.glm(fit.n,y[i,])
d=pnorm(caps,b\$central,b\$scale) # the predictions
p[i,] = diff(d) # d[1] prob < 0; 1-(sum(diff(d))+d[1]) prob >4
lines(as.numeric(levels(x\$cgpa))+.05,p[i,],col=i+2,type='h',lty=i,lwd=3)
}

```

Green solid is for the bad student, regression model; blue dashed is for the good student, regression model. You can see the spikes follow, more or less, the densities. The black solid is our original multinomial model for the bad student, and the red dashed the good student.

Which model is better? There is no way to tell, not with this plot. They are simply two different predictions from two different models. The only way to gauge goodness is to—all together now—wait for new data which has never been seen or used in any way, and then compare both models’ predictions against what happened.

Nevertheless, we can gather clues and build another, different predictive model, which predicts how well our original models will perform. Be sure you understand the distinction! We have an original predictive model. Then we create a model on how well this original predictive model will perform. These are not the same models! This right here is another major departure of the predictive method over classical counterparts.

Let’s first look at the regression predictions a little more closely (mixing code and output):

```> p
[,1]        [,2]       [,3]        [,4]         [,5]
[1,] 2.152882e-01 0.568404758 0.12265228 0.002511169 4.024610e-06
[2,] 1.109019e-06 0.001036698 0.07554629 0.511387677 2.646423e-01

> rowSums(p)
[1] 0.9088604 0.8526141
```

The first row is the bad student (low SAT and HGPA) at each of the five CGPAs (here labeled by their matrix notation), and the second the good student (high SAT and HGPA). The `rowSums(p)` gives the total probability of all possibilities. This should be 1 (right?). It isn’t; it is less, and that is because of probability leakage.

You can see that leakage is conditional on the assumptions, just like probability. Leakage isn’t constant. (It also depends on how we define our caps/cut points.) We earlier guessed, looking only at the densities, that it wasn’t that bad. But it turns out to be pretty large. We’re missing about 10% probability from the first prediction, and about 15% from the second. These are the probabilities for impossible events.

The leakage does not mean the model is useless, but it is evidence the model is sub optimal. It also does not mean our multinomial model is better, but at least our multinomial model does not have leakage. The multinomial model, for instance, can give large probabilities to events that did not happen, while the leakage regression model gives decent probabilities to what happened. We’ll have to see.

And that’s what we’ll do next time. It’s too much to try and begin formal verification this week.

Homework Assume the Dean wants to do CGPAs at 0, 0.5, …, 3.5, and 4. Rerun the code from the beginning, and end with the plot seen above, complete with the regression model using the obvious cut points. Notice anything different?

January 31, 2018 | 4 Comments

## Free Probability-Statistics Class: Predictive Case Study 1, Part X

Review!

Last time we created four models of CGPA. Which is correct? They all are. Why? I should ask as a homework question, but I’ll remind us here. Since all probability is conditional, and these are all ad hoc models, all are correct, given the assumptions. Which one is best? Depends on the decisions you want to make and on what you mean by “best.” Let’s see.

We discovered last time that by itself HGPA was relevant to CGPA, but not by much. SAT was also by itself relevant. The contour plots (which I have decided not to redo, since we did them last week) showed that SAT and HGPA when considered together are also relevant. We also created the “null” model (remember our terminology does not match the classical usage) which only used past data (and grading rules, etc.). We now have to see how useful each of these models are. (If you can’t remember what all these terms mean — review!)

In one sense, we cannot do too much more because we have the models and have made predictions using them. That was our goal, remember? Now we have to sit back and wait for new values of HGPA/SAT and CGPA come in. Then we can see how each of these models’ predictions match reality. This is the True Test.

Here, for instance, is an excerpt of predictions of the full model (HGPA/SAT) (I’m assuming we left off just where we were last time, so that `y` and `a` are still in R’s memory; if not, go over last week’s code):

```> cbind(y[,1:2],round(a,2))
sat hgpa    0    1    2    3    4
1   400    0 0.55 0.26 0.19 0.00 0.00
2   500    0 0.52 0.27 0.21 0.01 0.00
3   600    0 0.47 0.28 0.24 0.01 0.00
4   700    0 0.42 0.30 0.28 0.01 0.00
5   800    0 0.35 0.32 0.32 0.01 0.00
6   900    0 0.28 0.33 0.37 0.01 0.00
7  1000    0 0.22 0.35 0.41 0.02 0.00
8  1100    0 0.16 0.36 0.45 0.03 0.01
9  1200    0 0.11 0.36 0.48 0.04 0.01
10 1300    0 0.08 0.34 0.51 0.06 0.02
11 1400    0 0.05 0.31 0.52 0.08 0.03
12 1500    0 0.04 0.28 0.53 0.11 0.04
13  400    1 0.52 0.24 0.23 0.01 0.00
14  500    1 0.46 0.25 0.27 0.01 0.00
15  600    1 0.40 0.27 0.32 0.01 0.00
16  700    1 0.32 0.29 0.37 0.02 0.00
...
57 1200    4 0.00 0.04 0.22 0.58 0.15
58 1300    4 0.00 0.04 0.18 0.63 0.15
59 1400    4 0.00 0.04 0.14 0.66 0.15
60 1500    4 0.00 0.05 0.12 0.69 0.14
```

We could publish this, or the whole table, and we’d be done! Anybody could take these predictions and implement them. They wouldn’t have to know the details of your model, or of your original data. There is your bold theory, exposed for the world to see! That, after all, is how science should work.

Of course, all the shortcomings of your model will be obvious to anybody who tries to use it, too. Which, again, is just how it should be.

Make sure you understand what we’ve done so far. If you are the Dean and want to classify students into one of five CGPA buckets, we have a predictive model accounting for HGPA and SAT. But suppose you didn’t want to account for HGPA. Well, we have a model of just SAT: use that. And so on. The breakdowns of SAT (every 100) and HGPA (every 1) we used were also geared to the decision. Change the decision, change the breakdown, change the model.

In any case, this is it! This is what we wanted. We wanted to know, given the grading rules and old obs, and values of SAT and HPGA, what is the probability of having CGPA in one of the buckets. And this is what we did! We are done. All those people who wanted practical examples of the predictive way, well, here you go. In the end, it’s pretty simple, isn’t it?

But we can do two more things. (1) We can compare our predictive model (perhaps varying it here and there) with old-fashioned NHST/parameter-estimating models, and (2) we can create a new model that predicts the performance our current model. Number (2) is the really important step, but we won’t get to it today. Let’s do (1).

What model would most use in this situation? A linear regression. Here it is (mixing code and output again). The `cgpa.o` was the original CPGA, not classified into buckets. It is the raw score (the “o” is for original).

```fit.n= glm(cgpa.o ~ sat + hgpa , data=x) # note the cgpa.o! which is the original data
summary(fit.n)

Call:
glm(formula = cgpa.o ~ sat + hgpa, data = x)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-1.12153  -0.44120   0.00954   0.38198   1.80356

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0881312  0.2866638  -0.307 0.759169
sat          0.0012167  0.0003011   4.041 0.000107 ***
hgpa         0.4071133  0.0905946   4.494 1.94e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Wee p-values galore! The asterisks give the glory. Feel it. So, the classical statistician would say, SAT and HPGA are “linked to” CGPA. Researchers will speak of “statistical significance” and say “SAT drives high college grade point” and so on. How much does SAT influence (they’ll say “impact”) CGPA? Well, they might say for every increase in SAT by 1, CGPA goes up on average 0.0012 points. And so on.

Not much more would be done with this model, especially since everything is “significant”. Maybe the modeler throws in an interaction. Whatever. No matter what, this model exaggerates the evidence we have, and is in substantial error, even though it doesn’t look like it. Here’s why.

This model implies a prediction: all models imply predictions, even though they are not routinely made. It’s written in classical form, but the prediction is there, hidden away. Let’s look at it. We do so by integrating out the parameters, picking a “flat” prior, which, for this model anyway, gives us the exact same results for the parameters as the frequentist estimates.

Recall `fit` is our mutlinomial (like) model, including both SAT and HGPA. Let’s pick two hypothetical students, a good one and a bad one, and compare our predictive model with the prediction based on the ordinary regression. Before running this, first re-download the class code, which has been updated to include the code which calculates probability predictions from normal regression models. This is the `obs.glm`, which outputs a central, spread, and degrees-of-freedom parameter for the predictive distribution of the regression (this turns out to be a non-central T).

```y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4)) #bad and good student
a=predict(fit, newdata = y, type='prob')\$p #our mulitnomial moels

plot(levels(x\$cgpa), a[1,],type='h',xlab='College GPA',ylab='Probability',ylim=c(0,1.1*max(a)),lwd=5)
lines(levels(x\$cgpa), a[2,],type='h',lty=2,col=2,lwd=3)
grid()
legend('topright', c("SAT =  400, HGPA = 1","SAT = 1500, HGPA = 4"),lty=1:2,col=1:2,bty='n',lwd=2)

s1 = obs.glm(fit.n,y[1,]) # the prediction from the regression
s2 = obs.glm(fit.n,y[2,])

m = s1\$central-s2\$central  # plotting up the probability DENSITIES, NOT PROBABILITIES
s = sqrt(s1\$scale^2+s2\$scale^2)
w = seq(min(s1\$central,s2\$central)-3*s, max(s1\$central,s2\$central)+3*s, length.out = 100)
lines(w,.7*dt((w-s1\$central)/s1\$scale,s1\$df)/s1\$scale,lty=1,lwd=2)
lines(w,.7*dt((w-s2\$central)/s2\$scale,s2\$df)/s2\$scale,lty=2,col=2,lwd=2)
```

Our multinomial-like model gives the spikes; the regression densities, which are not probabilities, are the curves. We’ll fix the densities into probabilities later. But it’s densities, because the normal lives on the continuum, and so does the predictive distribution from the normal (the t). Notice anything odd? The regression gives probabilities to impossible values, i.e. CGPA less than 0 and greater than 4. I call this probability leakage.

It’s not terrible here, but it does exist. It means the model is predicting impossibilities. The standard regression model is at the least inefficient. Interestingly, this leakage model cannot be falsified. It gives positive probability to events we’ll never see, but it never gives 0 probability anywhere! Falsifiability isn’t that interesting.

That’s enough for this time. Next time, we turn the densities into probabilities, and make modifications to our multinomial model.

Homework Try the normal predictions for the singular models with SAT and HGPA alone, and for the “null” model, which is had by `glm(cgpa.o ~ 1 , data=x)`. Which has the most leakage? Are there huge differences?

Because you asked You can now download `briggs.homework.R`, which contains all the code used in the lectures. Note that this is different than `briggs.class.R`, which can be treated like black-box helper code.

January 23, 2018 | 4 Comments

## Free Probability-Statistics Class: Predictive Case Study 1, Part IX

Review!

Last time we successfully computed our model (8):

(8) Pr(CGPA = i | guesses of new measures, grading rules, old obs, model), where i = 0,…,5.

The model we selected was an ad hoc multinomial probit regression, with default prior selected by the MNP package. If we’re happy with that model, we’re done except for exploring other scenarios of SAT and HGPA. We can easily generate all possible combinations of decisionable values.

```w1 = seq(400,1500,100) # sat 400 - 1500, every 100
w2 = seq(0,4,1) # hgpa 0 - 4, every 1
y = as.data.frame(expand.grid(w1,w2)) # R loves data frame
y\$cgpa = "4" # this automatically creates the right number of cgpas
names(y) = c('sat','hgpa','cgpa') # these have to match spelling as in x!

a=predict(fit, newdata = y, type='prob')\$p # takes a couple of seconds to run; assumed fit is in memory!
```

I’ve decided the decisionable values of SAT and HGPA would be every 100 and 1, but you could easily change these in the `seq()` function. There’s no easy way to visualize this, since we have 2 measurements taking 60 combinations (at values I chose) in 5 dimensions (CGPA = 0-4). But we can do things like this:

```contour(w2,w1,matrix(a[,3],length(w2),length(w1),byrow=TRUE), xlab='SAT',ylab='HGPA', main='Pr(CGPA=2|E)')

# remember a[,1] Pr(CGPA=0|evidence), a[,2] Pr(CGPA=1|evidence), etc.

# Or for something fancier and colorful, try this
library(plotly)

plot_ly(x=w2,y=w1,z=matrix(a[,3],length(w1),length(w2)),type='contour') %>%
layout(title = 'Pr(CGPA=2|E)',
xaxis = list(title = 'HGPA'),
yaxis = list(title = 'Brain Weight (g)'))
```

Note: Cut and paste everything from `plot_ly` on down.

These plot the predicted probabilities of CGPA as contours by the values of SAT and HGPA. As with everything, whether these plots are of any value depends on the decisions you will make. They are probably way too specific, especially given the evidence the numerical approximations are only “close enough.” Our knowledge of that numerical approximation is part of our model; I mean, part of the right hand side of (8). Keep that in mind!

One decision is whether SAT or HGPA are relevant, in the presence of the other. If you cannot recall the definition of relevant — review! Let’s check!

```fit.null <- mnp(cgpa ~ 1, data=x, burnin = 1000, n.draws=50000)
fit.a <- mnp(cgpa ~ sat , data=x, burnin = 2000,n.draws=50000)
fit.b <- mnp(cgpa ~ hgpa, data=x, burnin = 1000, n.draws=50000)
```

The `fit.null` is the model (5) we did many lessons ago, or the form of it as a probit model with all the numerical approximation baggage. It is simply the predictive model not taking account anything but past observations, grading rules, and the model and approximations. SAT and HGPA should be relevant, or else `fit.null` is good enough.

```# we'll reuse the same y as above; measurements not used in the model aren't used in the prediction
a.null=predict(fit.null, newdata = y, type='prob')\$p
a.a=predict(fit.a, newdata = y, type='prob')\$p
a.b=predict(fit.b, newdata = y, type='prob')\$p
```

Recalling `y` is of length 60, here are the first 10 results:

```> a.null
0         1         2         3          4
[1,] 0.05869388 0.1568980 0.5764082 0.1632245 0.04477551
[2,] 0.06053061 0.1583878 0.5749388 0.1609796 0.04516327
[3,] 0.05902041 0.1564694 0.5805918 0.1601020 0.04381633
[4,] 0.06016327 0.1583469 0.5750408 0.1605510 0.04589796
[5,] 0.06004082 0.1562857 0.5766327 0.1621020 0.04493878
[6,] 0.05904082 0.1574694 0.5773265 0.1610612 0.04510204
[7,] 0.05961224 0.1603469 0.5763673 0.1597347 0.04393878
[8,] 0.05844898 0.1585918 0.5771224 0.1618367 0.04400000
[9,] 0.06110204 0.1580816 0.5772041 0.1587347 0.04487755
[10,] 0.06006122 0.1548571 0.5791224 0.1607755 0.04518367
```

This shows you the limits of the numerical approximation technique. All of the numbers in each column should be the same, since this is the predicted probability of CGPA unconditioned on SAT and HGPA. They're not wildly variable in the context of most decisions I can think of, but the variability is there. We could average each column and reduce some of the approximation error (why this is so I won't prove, but it should be intuitive enough). Mixing code and output:

```a.null = colMeans(a.null)

> a.null
0          1          2          3          4
0.05942313 0.15791837 0.57627857 0.16167925 0.04470068
```

So, according to (8a), the probability of CGPA = 2 is highest at 58%. Column 1 of `y` repeats the sequence `w1` length(w2) times. It does this because we were lazy and didn't want to create a new `y` for every model, content to use the original y with all possible (decisionable) combinations of SAT and HGPA. If you look at each of the blocks of SAT from 400 to 1500, you'll again see variability where none should ideally be. We can average the blocks to reduce variability if we want to be clever. This is getting tedious because we could have created new `y`, but on the other hand, we would not have seen the variability of the numerical approximation. Thus our laziness has given us (theoretically, anyway) superior guesses.

```# hack hack hack
# hack, but it should work whatever levels cgpa has, and whatever breakdowns of w1 and w2
# hack hack hack

b.1 = matrix(1:dim(y)[1],length(w1),length(levels(x\$cgpa)))
b.2 = b.1 *0
for(i in 1:length(w2)){
b.2 = b.2 + a.a[((i-1)*length(w1)+1):((i-1)*length(w1)+length(w1)), ]
}
b.2 = b.2/length(w2)
a.a = b.2 # replacement happens here!

b.1 = matrix(1:dim(y)[1],length(w2),length(levels(x\$cgpa)))
b.2 = b.1 *0
for(i in 1:length(w1)){
b.2 = b.2 + a.b[((i-1)*length(w2)+1):((i-1)*length(w2)+length(w2)), ]
}
b.2 = b.2/length(w1)
a.b = b.2 # replacement happens here!
```

Now we use `length(levels(x\$cgpa))` so often, it would be better to create a variable for it, and use it in all the code. I didn't. You do it as homework. As I said at the outset, this code is not optimized. It is written to be plain and simple to follow. Now let's check relevance of SAT!

```plot(w1,a.a[,1],type='b',col = 1, xlab='SAT', ylab='Pr')
for (i in 2:length(levels(x\$cgpa))){
lines(w1,a.a[,i],type='b',col = i)
}
points(rep(w1[1],length(levels(x\$cgpa))), a.null,pch=15, col=1:length(levels(x\$cgpa)),cex=2 )
legend('topright',levels(x\$cgpa),lty=1,col=1:length(levels(x\$cgpa)), bty='n')
```

This shows the predicted probabilities of each level of CGPA for the decisionable levels of SAT we picked. It also shows, at the far left in colored boxes corresponding to the colored lines, the predicted probabilities of the null model. The box probabilities differ from the line probabilities, so SAT is relevant. It is also in the causal chain, as we decided earlier. So this model, call it (8b) is different from (8a), the null model. (Note I do not use "null" in quite the same way as classical statistics.) The last check is whether the probabilities are different enough that our decisions would change. I say they are, because I am the decision maker. You might say, with your decisions in mind, the differences are not big enough. There is no universal magic number!

Now HGPA:

```plot(w2,a.b[,1],type='b',col = 1, xlab='HGPA', ylab='Pr',ylim=c(0,.6))
for (i in 2:length(levels(x\$cgpa))){
lines(w2,a.b[,i],type='b',col = i)
}
points(rep(w2[1],length(levels(x\$cgpa))), a.null,pch=15, col=1:length(levels(x\$cgpa)),cex=2 )
legend('topright',levels(x\$cgpa),lty=1,col=1:length(levels(x\$cgpa)), bty='n')
```

Notice I altered the default y-axis limits. The line probabilities are different from the boxed ones. So HGPA is relevant. But they aren't terribly different. Regardless of the value of HGPA the probabilities for CGPA are mostly flat. The only "big" discrepancy (where "big" relates to a decision) is the probability of CGPA = 2. That difference alone is decision-worthy, to coin a phrase. But it makes me wonder how useful HGPA is with SAT in the model. Back to the contour plots!

Which we'll do next time.

Homework See if you can figure out how to make the 2-D relevance plots. Remember the one above is only at one level of CGPA. Is HGPA still relevant? Yes. But would you, putting yourself in the mind of a Dean, find it useful to decisions?