Class - Applied Statistics

How To Do Predictive Statistics: Part III New (Free) Software Regression 2

Mandatory! Read Part I, Part II. I will ignore all comments already answered in Parts I & II.

Download the code: mcmc.pred.R, mcmc.pred.examples.R. If you downloaded before, download again. This is version 0.2!

Warning All the results were produced by an earlier version of the code, where one of the parameters to the normal was off. The code below is correct, but you will get (slightly) different results. I am on the road and haven’t time to redo pictures below. They are not far off, though.

Relevancy

Relevancy is what we’ll use to help deciding what information to include in our model. Relevancy is conditional, like all probability is conditional. A piece of evidence is relevant if it changes the probability of Y, given all the other pieces of evidence. Otherwise it is irrelevant.

     If Pr(Y|XW) ≠ Pr(Y|X), then W is relevant given X.

     If Pr(Y|XW) = Pr(Y|X), then W is irrelevant given X.

It could be that W is irrelevant given X, but relevant given Z, i.e. a different piece of evidence (that is not X). The kind of evidence W is does not matter.

If Pr(Y|new X, old X&Y_i, Model) ≠ Pr(Y|new X, old X&Y_i+1, Model), then the additional old data point is relevant given X and the Model; if the probabilities are equal, then the new data point is irrelevant. Change in probability with the addition of new data points is thus an interesting measure of interestingness of new data.

Relevancy is not the same as structural model change. For example Pr(Y|new X, old X&Y, Model_prior_1) ≠ Pr(Y|new X, old X&Y_i+1, Model_prior_2), i.e. everything the same except for two different priors, then we don’t say the prior is relevant. Obviously the model, of which is part of the model with all of its assumptions, change the probability of Y.

Now we often write Y = “y in s” when y is numeric. Then it could be that

     If Pr(y in s|XW) = Pr(Y in s|X), for some s and

     If Pr(y in s|XW) ≠ Pr(Y in s|X), for other s.

An s_1 interesting to one decision maker can come to a different conclusion of relevancy than a second decision maker who is keen on s_2.

This, like probability, is not a bug, it is a feature!

Relevancy, also like probability, is not decision. It could be

     If Pr(Y|XW) = Pr(Y|X) + epsilon,

where adding W changes the probability of Y (or y in s) only a tiny bit. W is then relevant as long as epsilon > 0. It is not—as in NOT—the statistician’s job to say which threshold of epsilon causes W to become useful. Usefulness is up to the decision maker. Let there be no standard threshold! We see the havoc wrought by insisting on the magic number of 0.05 for all classical analyses. Let us not make the same mistake.

Decisions do, however, have to be made. Is epsilon so small that it makes no difference to a decision that will be made using the model? If so, we leave W out; if not, we keep W. We often have causal knowledge, or something approaching it, about W. If we assume—and this therefore makes the assumption part of the right hand side of (2); ALL assumptions are part of the r.h.s. of (2)—W cannot possibly lie in the causal path of Y, then we judge W irrelevant regardless of any changes it makes to the probability of Y. A handful of thrown rice on the floor may resemble the face of President Taft, but since we know there is no causal mechanism in the rice that could do that, we judge rice to be irrelevant in judging probabilities of faces.

CGPA

Recall eq. (2): Pr(Y|new X, old X&Y, Model and assumptions). We need to supply for our model supposes or surmises of what new values of HGPA and SAT will be. We have a some idea of what these will be in the old data. We can thus use the old data to make a stab, though a weak one, at gauging how good the model is. Predictive statistics does not solve the problem of data reuse. Fooling ourselves is, as ever, a real possibility.

One strategy, then, is to step through the old data and pretend it will look like new values, calculate the probability in (2), and then make various plots, calculations and such forth. Like this:

# all the old x
q = matrix(0,nrow(x),3)
for(i in 1:nrow(x)){
  q[i,]=quantile(MCMCregress.pred(fit,x[i,]),c(0.1,.5,.9))
}

A limitation of R is that there is no mechanism for stepping through a data.frame while maintaining the class of each variable. The function apply coerces the data.frame into a matrix, and turns everything to characters (if any column is a factor). The others like lappy, mapply, Vectorize and so on turn the result into unworkable lists. I have searched, but have had no success, in discovering if anybody solved this problem. So we’re stuck with a loop; at least, for now.

This bit of code calculates the approximate quantiles for every old value of HGPA and SAT. There is nothing especially important in the values 0.1 and 0.9; but, really, calculating 95% intervals is rather severe. Do we need that much certainty in the results? Choose or add whichever you like, but be sure not to forget the definition of q.

A KEY NECESSARY MUST-HAVE notion is that since our model specified the probability of Y given HGPA and SAT, we must ALWAYS specify a value of HGPA and SAT when calculating (2). There is NO notion (besides relevancy) of studying HGPA in isolation of SAT, nor vice-versa. This applies to your models, too. If you have x1, x2, …, xp, each and every calculation, from now until forever, must specify values of x1, x2, …, xp.

There is NO notion of any variable in (complete) isolation of the others. This is an impossibility! If you don’t want to consider a measurement, then don’t put it in the model. This represents another major departure from the old ways of doing things.

What can we do with the q?

plot(x$hgpa,q[,2],ylab='cgpa',ylim=c(min(q),max(q)))
for(i in 1:nrow(x)){
  lines(c(x$hgpa[i],x$hgpa[i]),c(q[i,1],q[i,3]),col=3)
}

Do the same for x$sat. Looks better in ggplot2, but since that code can be distracting, we’ll stick to simple plots whenever we can.

As I just stressed, I hope sufficiently, each hgpa has a corresponding sat, but you don’t see sat plotted; only hgpa. So we only have a slice of an idea what is going on. You want to see both at the same time, in all their glory? We can, for 2D data, use 3D plots. But in general it’s tough cookies, mister. There is no general solution.

One of the key problems with classical analysis was that it made model analysis seem so easy. Get a wee p? Success! Bah, humbug. Real analysis, predictive analysis, is hard work. In predictive analysis the bigger the model gets, the harder the work becomes because complexity increases fast. This is, or rather should be, expected. But in classical analysis, it’s wee ps no matter how many terms in the model, each investigated in isolation (or mostly). It’s no wonder so many mistakes were made. Over-certainty was guaranteed.

What about, say, probabilities of CGPA > 3? How about this? You’ll need to install the plot3D package if you don’t have it.

library(plot3D)

q = NA
g = 3
for(i in 1:nrow(x)){
  p = MCMCregress.pred(fit,x[i,])
  q[i]= sum(p>=g)/length(p)
}

# try these, too, for fun
#plot(x$sat,q,ylab='Pr(CGPA>g|old data,M)')
#plot(x$hgpa,q,ylab='Pr(CGPA>g|old data,M)')

scatter3D(x$sat, x$hgpa, q, xlab='SAT',ylab='HGPA', zlab=paste0('Pr(CGPA>',g,'|D,M)'), phi = 0, bty ="g", ticktype = "detailed")

Contour (flat) plots can also be used, and perhaps are superior here. Though this 3D plot shows the dramatic rise in probability for high values of both measurements.

It is obvious both HGPA and SAT are relevant, at least for most values of “y in s”, and that most decision makers would judge the change in probabilities as useful. I did not do a search of all s, but you can. Instead, let’s add recomm to the model and see what it does. We already calculated probability greater than 3 for the model, stored in object q, without recomm. Now let’s do it again with recomm.

fit.r = MCMCregress(cgpa~hgpa*sat+recomm,data=x)
q.r = NA
for(i in 1:nrow(x)){
  p = MCMCregress.pred(fit,x[i,])
  q.r[i]= sum(p>=g)/length(p)
}

plot(x$recomm, q-q.r, ylim=c(-0.05,0.05))

summary(abs(q-q.r))

Since I, the Lord & Master of this site, and the decision maker, it was up to me to choose a level or threshold above which I judged recomm useful. I picked 0.05. That’s why I set the limits in the plot that way. Any dot sticking up above +/- 0.05 proves recomm is useful. To me. Maybe not to you.

As it is, the maximum (in absolute value) change in probability was 0.0026950. The median change was 0.0000925. Recommendation is relevant because the probability changes are non zero, and because I can’t imagine it would be non-causal (in the causal path). But a change in probability of 0.0001 isn’t even peanuts. This analysis only applies to a g = 3, i.e. “y in s” equals “y > 3”. Different s may lead to different conclusions. Your homework is to check.

Recall we included the multiplicative interaction between HGPA and SAT. Is that relevant? Is it useful? We now know how to check. So do so.

One word of caution. We have been using old data, on the reasonable theory that new values of HGPA etc. will look like the old. But the old data does not fill in all the gaps of possible values of HGPA, SAT, and Recommendation. This is no trouble. We can create our own data set—and we even must do so! That was the point of the analysis. We wanted this model to give uncertainty in Y (CGPA) given X (values of HGPA and SAT). So the analysis is not concluded until we make those predictions.

We already know how, too. It’s as simple as passing in a new data.frame with the values of HGPA and SAT in which we are interested.


data.frame(cgpa=0, hgpa = 4, sat = 1200)

Remmeber, at least for now, we have to pass in the Y with some value, but that it is ignored.

Enough already! I started off sickened by the CGPA data, and now I loathe it and never want to see it again. Try the software on new data, and compare the results of the predictive method to classical hypothesis testing. You will discover (a) sometimes the decisions are similar, sometimes they are wildly dissimilar, and (b) the predictive results are less certain, and (c) the predictive analysis is much harder.

Next time we start logistic regression with the low birth weight data.

4 replies »

  1. lappy, mapply, Vectorize and so on turn the result into unworkable lists. I have searched, but have had no success, in discovering if anybody solved this problem. So we’re stuck with a loop; at least, for now.

    I use lapply a lot.

    Normally, you would do something like:

    L = lapply(1:N, function(i) {
    # do something with i and return it
    quantile(MCMCregress.pred(fit,x[i,]),c(0.1,.5,.9))
    })
    q = do.call(‘rbind’, L) # for matrix row and data.frame returns
    q = do.call(‘cbind’,L) # for column returns
    q = do.call(‘c’,L) # for vector returns

    Drawbacks:
    -) the code is not executing in the parent environment so you can’t store a result in an outer environment variable.
    -) takes more memory — for a while, anyway
    -) the debugger (browser) treats the entire lapply as a single statement

    Advantage
    +) you don’t need to pre-allocate storage.
    +) the outer environment is protected against side effects
    +) the debugger (browser) treats the entire lapply as a single statement

    It’s not any faster than a loop.
    In your case, the loop is cleaner and more legible.

  2. Dear Dr. Briggs

    I think there is a typo in the last for loop, should not
    p = MCMCregress.pred(fit,x[i,])
    be
    p = MCMCregress.pred(fit,r,x[i,])?
    Probabilities were also very small, but they did change.

  3. Isn’t it funny? My observation about a typo had a typo…

    I meant
    p = MCMCregress.pred(fit.r,x[i,]) instead of
    p = MCMCregress.pred(fit,x[i,])

    Sorry about that.

Leave a Reply

Your email address will not be published. Required fields are marked *