## Free Data Science Class: Predictive Case Study 1, Part V

*This class is neither frequentist nor Bayesian nor machine learning in theory. It is pure probability, and unlike any other class available. And for the right price!*

We have all we need if we want to characterize our uncertainty in future CGPAs given *only* the grading rules, the old observations, and the simple math notions of the multinomial model. I.e., this:

(5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Given a set of observations of the number in each bucket, we can predict the probability of a new person having each possible CGPA. We are using the data found in *Uncertainty*, which is in a CSV file here.

CGPA comes with too much precision (I found the file on line many, many years ago, and cannot rediscover its origins), with measurements to the hundredth place. It is therefore useful to have a function that rounds to the nearest specified decisionable fraction. I modified this roundTo function to do the job (we’re using R, obviously).

roundTo <- function(y, num, down = FALSE) { resto = y%%num # to round down use '<=' if(down){ i = which(resto <= (num/2)) } else { i = which(resto < (num/2)) } # if you don't think you need binary subtract, try these, # which should both give 0; try other numbers in (0,1) # a=.88; a + 1 - a%%1 - 1 # a=.89; a + 1 - a%%1 - 1 y = y + `-`(num , resto) if(length(i)) y[i] = `-`(y[i] , num) return(y) }

The reason for the back ticks is given in the comments. Since we're classifying into buckets, floating point math can make buckets which should be 0 into something like 10^-16, which is not 0, and which is also not interesting. Use of the binary subtract function fixes this. If you don't understand the code, don't worry about it, just use it.

Read the data into R (put your path to the csv file into `path`

):

# path = 'C:/mydrive/mypath/' #windows; note direction of slashes # path = '/home/me/mypath/' # unix, mac x = read.csv(paste(path,'cgpa.csv',sep=''))

Then apply our function:

table(roundTo(x$cgpa,1)) 0 1 2 3 4 4 17 59 16 4

I'm mixing code and output here, but you should be able to get the idea. There are n = 100 observations, most of which are CGPA = 2. The model (5) in code (mpp for multinomial posterior predictive):

mpp <- function(x, nd = 3){ # nd = number of significant digits x = (1 + x)/(sum(x)+dim(x)) return(signif(x,nd)) }

This is model (5) in all its glory! Note that this is a bare-bones function. All code in this class is for illustration only, for ease of reading; nothing is optimized. This code does no error checking, doesn't handle missing values; it only spits out the answer given a `table`

as input, like this (the signif rounds to significant digits):

mpp(table(roundTo(x$cgpa,1))) 0 1 2 3 4 0.0476 0.1710 0.5710 0.1620 0.0476

Notice there is less than a 59% chance of a *new* CGPA = 2, but more than a 4/100 chance of a CGPA = 1. The future is less certain than the past! Suppose we wanted finer gradations of CGPA, say to the nearest 0.5:

table(roundTo(x$cgpa,1/2)) 0 0.5 1 1.5 2 2.5 3 3.5 4 2 3 8 21 33 20 7 4 2 mpp(table(roundTo(x$cgpa,1/2))) 0 0.5 1 1.5 2 2.5 3 3.5 4 0.0275 0.0367 0.0826 0.2020 0.3120 0.1930 0.0734 0.0459 0.0275

Play with other values of `num`

in `roundTo()`

. We're done, really, with what we can do with (5), except, of course, for checking it on real *new* measurements. Which I don't have. And which brings up an important point.

The point of the predictive method is to make testable predictions, which we have just done. But we can't test them until we get new measurements. Yes, we can and will check the old data as if it were new, but this is always cheating, because as everybody does or should know, it is always possible to derive a model which fits data arbitrarily well. Schemes which split data into "training" and "testing" sets cheat too if they ever in *any* way use the results of the testing data to tweak the model. That just is to use *all* the data in fitting/training. Though there are attempts and supposed techniques to reuse data, the *only* way to assess the performance of any model is to compare it against data that has never before been seen (by the model).

Model (5) can't be pushed further. But we do have other *formal*, measured information at hand, about which more in a moment. Of *in*formal, non-quantifiable evidence, we are loaded. We can easily do this:

(6) Pr(CGPA = 4 | grading rules, old observation, fixed math notions, E),

where E is a joint proposition carrying what you know about CGPA; things like, say, majors, schools, age, etc. Things which not formally measured and even unmeasurable. After all, to what schools, times, places, people does (5) apply? Pay attention: this is the big question! It *by itself* says *all* schools, *all* times, *all* places, *all* peoples---as long as they conform to the formal grading rules.

Pause and consider this. (5) is universal. If the old observations came from, say, Sacred Heart Institute of Technology and we knew that, which we don't (recall I found this data maybe twenty years ago from a place only the Lord knows), then we might insist E = "The school is Sacred Heart only". Or E = "The school is *like* Sacred Heart." *Like* is not quantifiable, and will differ widely in conception between people. Well, and that means (6) will be different for each different E. Each conception gives a different model!

Again, this is not a bug *it is a feature*.

Notice that (6) is not (5), a trivial point, perhaps, but one that can be forgotten if it is believed there is a "true" model out there somewhere, where "true" is used in the sense that probability is real or that we can identify cause. We've already discussed this, so if you don't have it clear in your mind, review!

Next time we introduce SAT, HGPA, and time spent studying, and see what we can do with this formal measurements.

**Homework**: Using (5) and the data at hand, suppose there are n = 20 new students. What can you say about the predictions of the numbers of new students having CGPA = 4, etc.?