Class - Applied Statistics

Free Data Science Class: Predictive Case Study 1, Part V

Review!

This class is neither frequentist nor Bayesian nor machine learning in theory. It is pure probability, and unlike any other class available. And for the right price!

We have all we need if we want to characterize our uncertainty in future CGPAs given only the grading rules, the old observations, and the simple math notions of the multinomial model. I.e., this:

    (5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Given a set of observations of the number in each bucket, we can predict the probability of a new person having each possible CGPA. We are using the data found in Uncertainty, which is in a CSV file here.

CGPA comes with too much precision (I found the file on line many, many years ago, and cannot rediscover its origins), with measurements to the hundredth place. It is therefore useful to have a function that rounds to the nearest specified decisionable fraction. I modified this roundTo function to do the job (we’re using R, obviously).

roundTo <- function(y, num, down = FALSE) {
    resto = y%%num
    # to round down use '<=' 
    if(down){
        i = which(resto <= (num/2))
    } else {
        i = which(resto < (num/2))
    }
    # if you don't think you need binary subtract, try these,
    # which should both give 0; try other numbers in (0,1)
    # a=.88; a + 1 - a%%1 - 1
    # a=.89; a + 1 - a%%1 - 1
    y = y + `-`(num , resto)
    if(length(i)) y[i] = `-`(y[i] , num) 
    return(y)
}

The reason for the back ticks is given in the comments. Since we're classifying into buckets, floating point math can make buckets which should be 0 into something like 10^-16, which is not 0, and which is also not interesting. Use of the binary subtract function fixes this. If you don't understand the code, don't worry about it, just use it.

Read the data into R (put your path to the csv file into path):

# path = 'C:/mydrive/mypath/' #windows; note direction of slashes
# path = '/home/me/mypath/' # unix, mac
x = read.csv(paste(path,'cgpa.csv',sep=''))

Then apply our function:

table(roundTo(x$cgpa,1))

 0  1  2  3  4 
 4 17 59 16  4 

I'm mixing code and output here, but you should be able to get the idea. There are n = 100 observations, most of which are CGPA = 2. The model (5) in code (mpp for multinomial posterior predictive):

mpp <- function(x, nd = 3){
  # nd = number of significant digits 
  x = (1 + x)/(sum(x)+dim(x))
  return(signif(x,nd))
}

This is model (5) in all its glory! Note that this is a bare-bones function. All code in this class is for illustration only, for ease of reading; nothing is optimized. This code does no error checking, doesn't handle missing values; it only spits out the answer given a table as input, like this (the signif rounds to significant digits):

mpp(table(roundTo(x$cgpa,1)))

     0      1      2      3      4 
0.0476 0.1710 0.5710 0.1620 0.0476 

Notice there is less than a 59% chance of a new CGPA = 2, but more than a 4/100 chance of a CGPA = 1. The future is less certain than the past! Suppose we wanted finer gradations of CGPA, say to the nearest 0.5:

table(roundTo(x$cgpa,1/2))

  0 0.5   1 1.5   2 2.5   3 3.5   4 
  2   3   8  21  33  20   7   4   2  

mpp(table(roundTo(x$cgpa,1/2)))

     0    0.5      1    1.5      2    2.5      3    3.5      4 
0.0275 0.0367 0.0826 0.2020 0.3120 0.1930 0.0734 0.0459 0.0275 

Play with other values of num in roundTo(). We're done, really, with what we can do with (5), except, of course, for checking it on real new measurements. Which I don't have. And which brings up an important point.

The point of the predictive method is to make testable predictions, which we have just done. But we can't test them until we get new measurements. Yes, we can and will check the old data as if it were new, but this is always cheating, because as everybody does or should know, it is always possible to derive a model which fits data arbitrarily well. Schemes which split data into "training" and "testing" sets cheat too if they ever in any way use the results of the testing data to tweak the model. That just is to use all the data in fitting/training. Though there are attempts and supposed techniques to reuse data, the only way to assess the performance of any model is to compare it against data that has never before been seen (by the model).

Model (5) can't be pushed further. But we do have other formal, measured information at hand, about which more in a moment. Of informal, non-quantifiable evidence, we are loaded. We can easily do this:

    (6) Pr(CGPA = 4 | grading rules, old observation, fixed math notions, E),

where E is a joint proposition carrying what you know about CGPA; things like, say, majors, schools, age, etc. Things which not formally measured and even unmeasurable. After all, to what schools, times, places, people does (5) apply? Pay attention: this is the big question! It by itself says all schools, all times, all places, all peoples---as long as they conform to the formal grading rules.

Pause and consider this. (5) is universal. If the old observations came from, say, Sacred Heart Institute of Technology and we knew that, which we don't (recall I found this data maybe twenty years ago from a place only the Lord knows), then we might insist E = "The school is Sacred Heart only". Or E = "The school is like Sacred Heart." Like is not quantifiable, and will differ widely in conception between people. Well, and that means (6) will be different for each different E. Each conception gives a different model!

Again, this is not a bug it is a feature.

Notice that (6) is not (5), a trivial point, perhaps, but one that can be forgotten if it is believed there is a "true" model out there somewhere, where "true" is used in the sense that probability is real or that we can identify cause. We've already discussed this, so if you don't have it clear in your mind, review!

Next time we introduce SAT, HGPA, and time spent studying, and see what we can do with this formal measurements.

Homework: Using (5) and the data at hand, suppose there are n = 20 new students. What can you say about the predictions of the numbers of new students having CGPA = 4, etc.?

18 replies »

  1. Given the data at hand, we may predict that there will more likely than not be one new student to achieve a ‘4’ CGPA. There may be two or more, of course, but more than one become increasingly unlikely, and the existence of the one is uncertain. This assumes a great deal – homogeneity of teachers and teaching and grading styles between class years, as well as the quality of students and their backgrounds, talents, and skills, and prior educational experiences. Has the school suddenly suffered a Title IX or affirmative action incident, which forced them to admit students other than by the school’s traditional acceptance methods? Had the public schools feeding this college adopted “common core” educational methods some years ago, the first crop of graduates from which are now just entering college? Have the students unionized after occupying the Dean’s office, demanding that all students receive not less that a ‘3’ CGPA? Have SJWs successfully infiltrated the school’s administration, and suddenly put into force new policies designed to achieve educational ‘fairness’?

  2. All you can say about the distribution of CGPA for the 20 new students is that it might be similar to the distribution of the 100 that fueled the model because they are college students. We have no idea about the differences between the 20 and the 100. Brighter/dimmer, younger/older, from a rich suburban HS or a poor urban one, English-speaking natives or creole-speaking Haitian immigrants (who by the way are taught in French in their home country), etc.? Whatever the distribution, the fraction in each grade bin is affected by the sample size being only one fifth as large as the original 100.

  3. Gary,

    I put the question badly. Of 20 new students, what is the probability all, none, and in between, of them have a CGPA of 4 (of 3, of 2, of 1, of 0)?

  4. More than 50 percent of the students were on probation. And the average CGPA is about 2. One has to question how the data were ascertain to avoid garbage-in-garbage-out. (Data Science Basic 0.)

    (mpp for multinomial posterior predictive):

    So, what you have here is an approximations via Bayesian framework

    As I stated before, your answers depend on how you round the data, iow, how you waste data information or how you manipulate data.

    If your round to the nearest integer, the probability of CGP=2.5 is 0.

    If you want to round to the nearest 0.5, the probability of CGP=2.5 is 0.193.

    Same data but big difference in the results of 0 and 0.193.  A red flag?!

    A simple, non-prametric frequenstist method is to use the empirical probability/ distribution, i.e., relative frequency. Since the data size is 100, therefore,

    If the data of size 100 are round to integers,
    0    1     2     3     4
    4   17   59   16   4

    The empirical probability distribution for CGPA is
    0         1         2        3        4
    0.04   0.17   0.59   0.16   0.04

    If the data are round to the nearest 0.5, the empirical probability distribution for CGPA is
    0          0.5     1       1.5      2         2.5      3      3.5     4
    0.02   0.03   0.08   0.21   0.33   0.20   0.07  0.04   0.02

    The results are not much different from the ones in this post.

    Is the relative frequency distribution is not as good?  It is better in terms of its simplicity and easiness in calculations. 

    ”(5) is universal”. What does this mean? The results here are not universally applicable regardless what it means, especially when the results may vary depending on the rounding scheme of the perfectly content data and when one has no idea how the data were obtained. Why disturb or modify  the data? Rounding of  data values is a modification of data and is usually only recommended at the resorting stage.

    To round a value y to the nearest num, use
    ans=round(y / num)* num

  5. JH,

    If your round to the nearest integer, the probability of CGP=2.5 is 0.

    If you want to round to the nearest 0.5, the probability of CGP=2.5 is 0.193.

    Same data but big difference in the results of 0 and 0.193. A red flag?!

    No, merely the result of probability being conditional on the assumptions made. These are both the right answers to different problems, in the same way x + y = 17 is different for x given different values of y.

    “Is the relative frequency distribution is not as good?”

    The past observations are by definition good. Predicting what will happen uses them, so naturally the observations are good at helping predict the future.

    “‘(5) is universal’. What does this mean?”

    It means exactly what it says. Since the right hand side does not specify where and when to use (5), it must necessarily be useful everywhere (where the grading rules apply). To limit its use, we need to change the model, i.e. by adding to the right hand side. The same answer applies to your comment about rounding.

    All,

    Think more about what “all probability is conditional on the assumptions made” means.

    Update Here’s an example where the relative frequency is no good. Because there is no relative frequency. Before you have made any measurements, you can still predict the future.

    mpp(as.table(c(0,0,0,0,0)))
    
      A   B   C   D   E 
    0.2 0.2 0.2 0.2 0.2 
    

    We deduced that answer in an earlier lesson.

    Rounding also is no problem. Try thinking of the grading buckets as actual buckets into which actual students must fit. If there are only 5 (rounded to nearest 1), then all students must fit into one of these buckets. If there are more (rounded to nearest specified fraction), again students must fit into one of the buckets.

  6. Did I somehow manage to raise the question of whether the past observations are good?

    Sure, different premises (information) may lead to different conclusions. No need to use any calculations to demonstrate this point. No calculations needed to mash tofu, either.

    Let me offer you a solution to the red flag of having drastically different probability estimates for CGPA=2.5 even though the same data evidence is used.

    If rounding to the nearest integer is performed, one would conclude that the (estimated) probabilities that CGPA falling in [0, 0.5), [0.5, 1.5), …,[3.5, 4] are (insert the numbers here to reflect the rounding of the raw data. Basically, “0” groups the data values from 0 to 0.49, “1” from 0.5 to 1.49…. and so on.

    Think of how a relative frequency histogram (distribution) is constructed and reported.

    Anyway, one usually doesn’t modify the data to suit the statistical model or method he would like to use. It’s the other way around.

  7. JH,

    “Anyway, one usually doesn’t modify the data to suit the statistical model or method he would like to use. It’s the other way around.”

    Yes! Exactly so! This is it!

    What we’re doing here is not the usual way. It’s something so new it’s old: the original uses of probability.

    Things do not have a probability. Maybe we can all get this rounding example thinking of a more concrete decision. Suppose the Dean is assigning sophomores to one of five dorms based on their first-year grades, 0-4. That’s the decision, and we use certain premises (such as grading rule, the ad hoc formal math assumptions, and so on) and the old observations as they fit into this decision scheme to deduce our probabilities. “Rounding” (and the probability model) necessarily follows.

    This is exactly the opposite of the frequentist-Bayesian classical analysis, which begins by assuming (and then forgetting the assumption) that CGPA “has” a distribution which remains only to be discovered, and then does all sorts of other things involving parameters which never answer the real questions of interest. Further proof of that is in Uncertainty and in future lessons.

  8. So the “multinomial posterior predictive probability” is built thusly:
    – create a table of frequencies from the data
    – add one to all the frequencies
    – calculate the new relative frequencies

    All that betabinomial stuff was much harder. Where did we use the”Dirichlet prior”?

  9. Rich,

    Good question. The PDF linked in the lesson before derives all the math.

    Don’t forget, the simple model (5) only takes past observations, the grading rules and the math you’ll look up, and makes predictions. It turns out not to be that difficult!

  10. It is not the usual way because tampering data is wrong. The consequence of tampering data is not on a par with the one of misusing p-values. It is much worse.

    Why would you need to deduce your probability distribution for the dorm-assignment example? The computer can automatically produce the dorm assignment once the CGPA data are available. So, tell me the real question of YOUR interest here.

    But it is true that to use the approximation you’d need to GROUP the data values… multinomial, that is. After all, it is derived from the Bayesian framework of using the multinomial likelihood and a convenient conjugate Dirichlet prior with fixed parameter values. There is nothing deductive about the approximated probability distribution when the results would depend on the data and the rounding scheme.

    (Note the word distribution means how the probabilities are distributed among possible values. )

    And if the results clearly contradict the observations (e.g., P(CGPA=2.5)=0, yet there is an observation value of 2.5), think about what is wrong.

    Nothing is assumed in the construction of relative frequency histogram/distribution. A simple relative frequency distribution (aka, empirical probability distribution (EPD) in this case) is just fine for summarizing the univariate CGPA data, and for making appropriate generalizations if there is sufficient background information, and for whatever purpose the approximated probability distribution you love is meant to serve.

    Just a univariate data set. A simple EPD will do. don’t need a dagger to slice tofu.

    Things do not have a probability.
    Is it possible to go beyond such rhetoric?!

  11. JH,

    I don’t think I’ve explained well the example or the way I’m using terminology, which I announced at the outset would be different than is usual. We cannot frame everything with respect to the usual way.

    (1) The example. Note first that in the way we are using pure probability, we look at the measurements with respect to the decision we will make. Things do not have probabilities. Two people with different decisions can (and usually will) come to different probabilities, even if they use the same “base” model (more next). Here is the example we must all comment on. There are 5 houses, 0-4, into which a student will reside his sophomore year based on his first-year grade. What is the chance, given the grading rules, past obs, and ad hoc model, he’ll be assigned to each house? The rounding necessarily follows. There is no 2.5 house so the probability is 0. Change the decision to increase the number of houses, 0, 0.5, etc. The decision has changed. The probability has changed. Again, the new rounding necessarily follows (there could be no 2.5s in the decision is 0-4).

    (2) The model. Unfortunately, I used “deduction” twice in different contexts, which caused confusion. Now given any pr(y|w,x,z), the probability of y is “deduced”. This is so even if z = “ad hoc assumptions”, which as with the Dirichlet prior etc. But there is the deeper sense of “deduced” in which the model itself, the z, is deduced based on the measurement of the observables. I should call that something else. Model deduction, perhaps; versus probability deduction. Again, change the measurement, change the model.

  12. Mr Briggs,
    I have looked for the link to the pdf you mentioned but not found it. Could you put it in a comment perhaps?

  13. Thanks for the link.

    Homework.

    I suggest that the number of students in the new group of twenty with a cgpa of 4 will follow a binomial distribution with n=20 and p=0.0476. Similarly with cgpa’s of 0-3 and the corresponding value of p in [0.0476 0.1710 0.5710 0.1620 0.0476].
    You can get the same result summing all the multinomial probabilities for each value of cpga=4 in all possible combinations of 20 students cgpa’s.

  14. (1) When you use terms like “pure probability” and “non-real parameters”, it sends question marks to my brain as to whether there are “non-pure” and “real” ones. Are those terms necessary? OK… things do not have probabilities, and so what?

    0.5 house! I know people who count with alphabet, but you are the first person I know who don’t use whole numbers to count. (I always count in Chinese, in counting numbers though.) It is obvious if there are more houses available the dorm assignment will be different. I imagine the number of houses would be pre-determined at the planning stage before the data collection stage.

    Furthermore, you may change the method you label or number the houses anyway you wish, and the labels are treated as categorical not numerical data. Changing the label doesn’t change the house a label is supposed to represent. CGPA data are numerical. If you change/modify the CGPA from 2.5 to 3 or 2.49 to 2, you’ve change the information contained in the data. Oh, also a rounding of CGPA from 2.46 to 2 may unjustly send a student to a Dog House.

    (2) Deeper sense of “deduced”? Here me go again. What is the shallower sense then? It’s either deduced or not. If all premises are true, deductive reasoning would lead to true conclusions.
    Yes, change the measurement, change the model. And depending on the situation, instead of “change”, the word “fabricate” or “tamper” or “modify” may be used.

  15. There are 5 houses, 0-4, into which a student will reside his sophomore year based on his first-year grade. What is the chance, given the grading rules, past obs, and ad hoc model, he’ll be assigned to each house?

    If students are to be assigned to House Kirk, House Picard, House Sisko, House Janeway, or House Archer based on their 1st year grade, then give the rules and use the grade stored in the computer to sort them into different houses.

    Why would you need an ad hoc model or need to know the chances? And how and what data are analyzed? ( I don’t see the house labels or numbers contained in the data to be analyzed. )

    If you set up the fake scenario correctly or for some reason, you’d like to compute the chances, how would each student be assigned to each house based on the probability results? I imagine you will need a rule (just like if p-value < 0.05 then…) to assign a house to each of the students.

Leave a Reply

Your email address will not be published. Required fields are marked *