Category: Class – Applied Statistics

December 12, 2017 | 18 Comments

Free Data Science Class: Predictive Case Study 1, Part V

Review!

This class is neither frequentist nor Bayesian nor machine learning in theory. It is pure probability, and unlike any other class available. And for the right price!

We have all we need if we want to characterize our uncertainty in future CGPAs given only the grading rules, the old observations, and the simple math notions of the multinomial model. I.e., this:

(5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Given a set of observations of the number in each bucket, we can predict the probability of a new person having each possible CGPA. We are using the data found in Uncertainty, which is in a CSV file here.

CGPA comes with too much precision (I found the file on line many, many years ago, and cannot rediscover its origins), with measurements to the hundredth place. It is therefore useful to have a function that rounds to the nearest specified decisionable fraction. I modified this roundTo function to do the job (we’re using R, obviously).

```roundTo <- function(y, num, down = FALSE) {
resto = y%%num
# to round down use '<='
if(down){
i = which(resto <= (num/2))
} else {
i = which(resto < (num/2))
}
# if you don't think you need binary subtract, try these,
# which should both give 0; try other numbers in (0,1)
# a=.88; a + 1 - a%%1 - 1
# a=.89; a + 1 - a%%1 - 1
y = y + `-`(num , resto)
if(length(i)) y[i] = `-`(y[i] , num)
return(y)
}
```

The reason for the back ticks is given in the comments. Since we're classifying into buckets, floating point math can make buckets which should be 0 into something like 10^-16, which is not 0, and which is also not interesting. Use of the binary subtract function fixes this. If you don't understand the code, don't worry about it, just use it.

Read the data into R (put your path to the csv file into `path`):

```# path = 'C:/mydrive/mypath/' #windows; note direction of slashes
# path = '/home/me/mypath/' # unix, mac
```

Then apply our function:

```table(roundTo(x\$cgpa,1))

0  1  2  3  4
4 17 59 16  4
```

I'm mixing code and output here, but you should be able to get the idea. There are n = 100 observations, most of which are CGPA = 2. The model (5) in code (mpp for multinomial posterior predictive):

```mpp <- function(x, nd = 3){
# nd = number of significant digits
x = (1 + x)/(sum(x)+dim(x))
return(signif(x,nd))
}
```

This is model (5) in all its glory! Note that this is a bare-bones function. All code in this class is for illustration only, for ease of reading; nothing is optimized. This code does no error checking, doesn't handle missing values; it only spits out the answer given a `table` as input, like this (the signif rounds to significant digits):

```mpp(table(roundTo(x\$cgpa,1)))

0      1      2      3      4
0.0476 0.1710 0.5710 0.1620 0.0476
```

Notice there is less than a 59% chance of a new CGPA = 2, but more than a 4/100 chance of a CGPA = 1. The future is less certain than the past! Suppose we wanted finer gradations of CGPA, say to the nearest 0.5:

```table(roundTo(x\$cgpa,1/2))

0 0.5   1 1.5   2 2.5   3 3.5   4
2   3   8  21  33  20   7   4   2

mpp(table(roundTo(x\$cgpa,1/2)))

0    0.5      1    1.5      2    2.5      3    3.5      4
0.0275 0.0367 0.0826 0.2020 0.3120 0.1930 0.0734 0.0459 0.0275
```

Play with other values of `num` in `roundTo()`. We're done, really, with what we can do with (5), except, of course, for checking it on real new measurements. Which I don't have. And which brings up an important point.

The point of the predictive method is to make testable predictions, which we have just done. But we can't test them until we get new measurements. Yes, we can and will check the old data as if it were new, but this is always cheating, because as everybody does or should know, it is always possible to derive a model which fits data arbitrarily well. Schemes which split data into "training" and "testing" sets cheat too if they ever in any way use the results of the testing data to tweak the model. That just is to use all the data in fitting/training. Though there are attempts and supposed techniques to reuse data, the only way to assess the performance of any model is to compare it against data that has never before been seen (by the model).

Model (5) can't be pushed further. But we do have other formal, measured information at hand, about which more in a moment. Of informal, non-quantifiable evidence, we are loaded. We can easily do this:

(6) Pr(CGPA = 4 | grading rules, old observation, fixed math notions, E),

where E is a joint proposition carrying what you know about CGPA; things like, say, majors, schools, age, etc. Things which not formally measured and even unmeasurable. After all, to what schools, times, places, people does (5) apply? Pay attention: this is the big question! It by itself says all schools, all times, all places, all peoples---as long as they conform to the formal grading rules.

Pause and consider this. (5) is universal. If the old observations came from, say, Sacred Heart Institute of Technology and we knew that, which we don't (recall I found this data maybe twenty years ago from a place only the Lord knows), then we might insist E = "The school is Sacred Heart only". Or E = "The school is like Sacred Heart." Like is not quantifiable, and will differ widely in conception between people. Well, and that means (6) will be different for each different E. Each conception gives a different model!

Again, this is not a bug it is a feature.

Notice that (6) is not (5), a trivial point, perhaps, but one that can be forgotten if it is believed there is a "true" model out there somewhere, where "true" is used in the sense that probability is real or that we can identify cause. We've already discussed this, so if you don't have it clear in your mind, review!

Next time we introduce SAT, HGPA, and time spent studying, and see what we can do with this formal measurements.

Homework: Using (5) and the data at hand, suppose there are n = 20 new students. What can you say about the predictions of the numbers of new students having CGPA = 4, etc.?

December 5, 2017 | 9 Comments

Free Data Science Class: Predictive Case Study 1, Part IV

Review!

Code coming next week!

Last time we decided to put ourselves in the mind of a dean and ask for the chance of CGPA falling into one of these buckets: 0,1,2,3,4. We started with an simple model to characterize our uncertainty in future CGPAs, which was this:

(5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Now the “fixed math notions” means, in this case, that we uses a parameterized probability multinomial distribution (look it up anywhere). This model via the introduction of non-observable, non-real parameters, little bits of math necessary for the equations to work out, gives the probability of belonging to one of the buckets, which in this case are 5, 0-4.

The parameters themselves are the focus of traditional statistical practice, in both its frequentist and Bayesian flavors. This misplaced concentration came about for at least two reasons: (a) the false belief that probabilities are real and thus so are parameters, at least “at” infinity and (b) the mistaking of knowledge of the parameters for knowledge of observables. The math for parameters (at infinity) is also easier than looking at observables. Probability does not exist, and (of course) we now know knowledge of the parameters is not knowledge of observables. We’ll bypass all of this and keep our vision fixed on what is of real interest.

Machine learning (and AI etc.) have parameters for their models, too, for the most part, but these are usually hidden away and observables are primary. This is a good thing, except that the ML community (we’ll lump all non-statistical probability and “fuzzy” and AI modelers into the ML camp) created for themselves new errors in philosophy. We’ll start discussing these this time.

Our “fixed math notions”, as assumption we made, but with good reason, include selecting a “prior” on the parameters of the model. We chose the Dirichlet; many others are possible. Our notions also selected the model. Thus, as is made clear in the notation, (5) is dependent on the notions. Change them, change answer to (5). But so what? If we change the grading rules we also change the probability. Changing the old observations also changes the probability.

There is an enormous amount of hand-wringing about the priors portion of the notions. Some of the concern is with getting the math right, which is fine. But much is because it is felt there are “correct” priors somewhere out there, usually living at Infinity, and if there are right ones we can worry we might not have found the right ones. There are also many complaints that (5) is reliant on the prior. But (5) is always reliant on the model, too, though few are concerned with that. (5) is dependent on everything we stick on the right hand side, including non-quantifiable evidence, as we saw last time. That (5) changes when we change our notions is not a bug, it is a feature.

The thought among both the statistical and ML communities is that a “correct” model exists, if only we can find it. Yet this is almost never true, except in those rare cases where we deduce the model (as is done in Uncertanity for a simple case). Even deduced models begin with simpler knowns or assumptions. Any time we use a parameterized model (or any ML model) we are making more or less ad hoc assumptions. Parameters always imply lurking infinities, either in measurement clarity or numbers of observations, infinities which will always be lacking in real life.

Let’s be clear: every model is conditional on the assumptions we make. If we knew the causes of the observable (here CGPA; review Part I) we could deduce the model, which would supply extreme probabilities, i.e. 0s and 1s. But since we cannot know the causes of grade points, we can instead opt for correlation models, as statistical and ML models are (any really complex model may have causal elements, such as in physics etc., but these won’t be completely causal and thus will be correlational in output).

This does not mean that our models are wrong. A wrong model would always misclassify and never correctly classify, and it would do so intentionally, as it were. This wrong model would be a causal model, too, only it would purposely lie about the causes in at least some instances.

No, our models are correlational, almost always, and therefore can’t be wrong in the sense just mentioned; neither can they be right in the causal sense. They can, however, be useful.

The conceit by statistical modelers is that, once we have in hand correlates to our observables, which in this case will be SAT scores and high school GPAs, that if we increase our sample size large enough, we’ll know exactly how SAT and HGPA “influence” CGPA. This is false. At best, we’ll sharpen our predictive probabilities to some extent, but we’ll hit a wall and will go no further. This is because SAT scores do not cause CGPAs. We may know all there is to know about some non-interesting parameters inside some ad hoc model, but this intensity will not transfer to the observables, which may be as murky as ever. If this doesn’t make sense, the examples will clarify it.

The similar conceit of the ML crowd is that if only the proper quantity of correlates are measured, and (as with the statisticians) measured in sufficient number, all classification mistakes will disappear. This is false, too. Because unless we can measure all the causes of each and every person’s CGPA, the model will err in the sense that it will not produce extreme probabilities. Perfect classification is a chimera—and a sales pitch.

Just think: we already know we cannot know all the precise causes of many contingent events. For example, quantum measurements. No parameterized model nor the most sophisticated ML/AI/deep learning algorithm in the world, taking all known measurements as input, will classify better than the simple physics model. Perfection is not ours to have.

Next time we finally—finally!—get to the data. But remember we were in no hurry, and that we purposely emphasized the meaning and interpretation of models because there is so much misinformation here, and because these are, in the end, the only parts that matter.

December 1, 2017 | 5 Comments

Parameters Aren’t What You Think

I was asked to comment on a post by Dan Simpson exploring the Bernstein-von Mises theorem.

This post fits in with the Data Science class, a happy coincidence, and has been so categorized.

A warning. Do not click on the link to a video by Diamanda Galas. I did. It was so hellishly godawful that I have already scheduled a visit with my surgeon to have my ear drums removed so that not even by accident will I have to listen to this woman again.

Now the Bernstein-von Mises theorem says, loosely and with a list of caveats given by Simpson, that for a parameterized probability model and a given prior, the posterior on the parameter converges (in probability) to a multivariate normal with a covariance matrix an inverse function n and of the Fisher Information Matrix centered around the “true” parameter.

It doesn’t matter here about the mathematical details. The rough idea is that, regardless of the prior used but supposing the caveats are met, the uncertainty in the parameter becomes like the uncertainty a frequentist would assess of the parameter. Meaning Bayesians shouldn’t feel too apologetic around frequentists frightened that priors convey information. It’s all one big happy family out at The Limit.

There is no Limit, though. It doesn’t exist for actual measures.

You know what else doesn’t exist? Probability. Things to not “have” probabilities or probability distributions (we should never say “This has a normal distribution”). It is only our uncertainty that can be characterized using probability (we should say “Our uncertainty in this is quantified by a normal”). And also non-existent, as a deduction from that truth, are parameters. Since parameters don’t exist, there can’t be a “true” value of them. Yet parameters are everywhere used and they are (or seem to be) useful. So what’s going on?

Recall our probabilistic golden rule: all probability is conditional on the assumptions made, believed, deduced or measured.

Probability can often be deduced. The ubiquitous urn models are good examples. In an urn are n_0 0s and n_1 1s. Given this information (and knowledge or English grammar and logic), the chance of drawing a 1 is n_1/(n_1+n_0). In notation:

(1) Pr(1 | D) = n_1/(n_1+n_0),

where D is a joint proposition containing the information just noted (the deduction is sound and is based on the symmetry of logical constants where there is no need to talk of drawing mechanisms, randomness, fairness, or whatever; See Uncertainty for details).

If all (take the word seriously) we know of the urn are its constituents, we have all the probability we need in (1). We are done. Oh, we can also deduce answers to questions like, “What are the chances of seeing 7 1s given we have already taken such-and-such from the urn.” But the key is that all is deduced.

So what if we don’t know how many 0s and 1s there are, but we still want:

(2) Pr(1 | U) = ?,

where U means we know there are 1s and 0s but we don’t know the proportion (plus, as usual and forever, we also in U know grammar, logic, etc.). Well, it turns out the answer is still deducible as long as we assume a value n = n_1 + n_0 exists. We don’t even need to know it, really; we just need to assume it is less than infinity. Which it will be. No urn contains an infinite number of anything. Intuitively, since we have no information on n_1 or n_0, except that they must be finite, we can solve (2). The answer is 1/2. (Take a googol to the googol-th power a googol times; this number will be finite and bigger than you ever need.)

As above, we can ask any kind of question, e.g. “Given I’ve removed 18 1s and 12 0s, and I next grab out 6 balls, what are the chance at least 3 will be 1s?” The answer is deducible; no parameter is needed.

So what if we don’t know n? It turns out not to be too important after all. We can still deduce all the probabilities we want, as long as n is finite. What if n is infinite, though? Well, it can’t be. But what if we assume it is?

We have left physical reality and entered the land of math. We could have solved the problem above for any value of n short of infinity, and since we can let n be very large indeed, this is no limitation whatsoever. Still, as Simpson rightly says, asymptotic math is much easier than finite, so that if we’re willing to close our eyes to the problem of infinitely sized urns, maybe we can make our lives computationally easier.

Population increase

Statistics revolves around the idea of “samples” taken from “populations”. Above, when n was finite, our population was finite, and we could deduce probabilities for the remaining members of the population given we’ve removed so much sample. (Survey statistics is careful about this.)

But if we assume an infinite population, no matter how big a sample we remove, we always have an infinite number left. The deductions we produced above won’t work. But we still want to do probability—without the hard finite population math (and it is harder). So we can try this:

(3) Pr(1 | Inf) = θ,

where Inf indicates we have an infinite urn and θ is a parameter. The value of this parameter is unknown. What exactly is this θ? It isn’t a probability in any ontological sense, since probabilities don’t exist. It’s not a physical measure as n_1/(n_1+n_0) was, because we don’t know what n_1 and n_0 are except that there are an infinite number of each of them and, anyway, we can’t divide infinities so glibly. (The probability in (1) is not the right hand side; it is the left hand side. The right hand side is just a number!)

The answer is that θ isn’t anything. It’s just a parameter, a placeholder. It’s a blank spot waiting to be filled. We cannot provide any answers to (3) (or questions like those above based on it) until we make some kind of statement about θ. If you have understood this last sentence, you have understood all. We are stuck, the problem is at a dead end. There is nowhere to go. If somebody asks, “Given Inf, what is the probability of a 1?” all you can say is “I do not know” because you understand saying “θ” is saying nothing.

Bayesians of course know they have to make some kind of statement about θ, or the problem stops. But there is no information about θ to be had. In the finite-population case, we were able to deduce the probability because we knew n_1 could equal 0, 1, …, n, with the corresponding adjustments made to n_0, i.e. n, n-1, …, 0. No combination (this can be made more rigorous) was privileged over any other, and the deduction followed. But when the population is infinite, it is not at all clear how to specify the breakdowns of n_1s and n_0s in the infinite urn; indeed, there are an infinite number of ways to do this. Infinities are weird!

The only possible way out of this problem is do what the serial writer of old did: with a mighty leap, Jack was free of the pit! An ad hoc judgment is made. The Bayesian simply makes up a guess about θ and places it in (3). Or not quite, but that would work about would give us

(4) Pr(1 | Inf; θ = 0.5) = θ (= 0.5).

Hey, why not? If probability is subjective, which it isn’t, then probability can equal anything you feel. Feelings…whoa-oh-a feelings.

No, what the Bayesian does is invoke outside evidence, call it E, which sounds more or less scientific or mathematical, and information about θ, now called the prior, is given. The problem is then solved, or rather it is solvable. But it’s almost never solved.

The posterior is not the end

Having become so fascinated by θ, the statistician cannot stop thinking of it, and so after some data is taken, he updates his belief about θ and produces the posterior. That’s where we came in: at the end.

This posterior will, given a sufficient sample and some other caveats, look like the frequentist point estimate and its confidence interval. Frequentists are not only big believers in infinity, they insist on it. No probability can be defined in frequentist theory unless infinite samples are available. Never mind. (Frequentism always fails in finite reality.)

You know what happens next? Nothing.

We have the posterior in hand, but so what? Does that say anything about (3)? No. (3) was what we wanted all along, but we forgot about it! In the rush to do the lovely (and it is) math about priors and posteriors we mislaid our question. Instead, we speak solely about the posterior (or point estimate). How embarrassing. (Come back Monday for more on this subject.)

Well, not all Bayesians forget. Some take the posterior and use it to produce the answer to (3), or what turns out to be the modification of (3), and what is called the posterior predictive distribution.

(5) Pr(1 | Inf; belief about θ) = some number.

Here is the funny part, at least for this problem. If we say, as many Bayesians do say, that θ is equally likely to be any number between 0 and 1 (a “flat” prior), then the posterior predictive distribution is exactly the same as the answer for (1).

That’s looking at the wrong way around, though. What happens is that if you take (1) (take it mathematically, I mean) and let n go to infinity in a straightforward way, you get the posterior predictive distribution of (3) (but only with a “flat” prior).

So, at least in this case, we needn’t have gone to the bother of assuming an infinite urn, since we had the right answer before. Other problems are more complex, and insufficient attention has been paid to the finite math, so we don’t have answers in every problem. Besides, it’s easier to assume an infinite-parameter based model and work out that math.

Influence

Assuming there is an infinite population, not only of 1s and 0s, but for any statistical problem, is what leads to the false belief that “true” values of parameters exist. This is why people will say “X has a distribution”. Since folks believe true values of parameters exist, they want to be careful to guess what they might be. That’s where the frequentist-Bayesian interpretation wars enter. Even Bayesians joust with each over their differing ad hoc priors.

It should be obvious that, just as assuming a model changes the probability of the observable of interest (like balls in urns), so does changing the prior for a fixed model change the probability of the observable. Of course it does! And should. Because all probability is conditional on the assumptions made; our golden rule. Change the assumptions, change the prior.

There is no sense whatsoever that an “noninformative” prior can exist. All priors by design convey information. To say the influence of the prior should be unfelt is like saying there should be married bachelors. It makes no logical sense. There isn’t even any sense that a prior can be “minimally” informative. To be minimally informative is to keep utterly quiet and say nothing about the parameter.

If there is any sense that a “correct” prior exists, or a “correct” model for that matter, it is in the finite-deducible sense. We start with an observable that has known finite and discrete measurements qualities, as all real observables do, and we deduce the probabilities from there. We then imagine we have an infinite population, as an augmentation of finite reality, and we let the sample go to infinity. This will give and implied prior and posterior and predictive distribution we which can compare against the correct finite sample answer.

But if we had the correct finite sample answer, why use the infinite approximation? Good question. The only answer is computational ease. Good answer, too.

Even though it might not look it, this little essay is in answer to Simpson. I’m answering the meta-question behind the details of the Bernstein-von Mises theorem, the math of which nobody disputes. As always, it’s the interpretation that matters. In this case, we can invert the BnM theorem and use it to show how far wrong frequentist point-estimates are. After all, frequentist theory can be seen as the infinite-approximation method to Bayesian problems—which themselves are when using parameters infinite-population approximations to finite reality. Frequentist methods are therefore a double approximation, which is another reason they tend to produce so much over-certainty.

What I haven’t talked about, and what there isn’t space for, are these so-called infinite dimensional models, where there are an infinity of parameters. I’ll just repeat: infinity is weird.

November 28, 2017 | 6 Comments

Free Data Science Class: Predictive Case Study 1, Part III

You must review: Part I, II. Not reviewing is like coming to class late and saying “What did I miss?” Note the New & Improved title!

Here are the main points thus far: All probability is conditional on the assumptions made; not all probability is quantifiable or must involve observables; all analysis must revolve on ultimate decisions; unless deduced, all models (AI, ML, statistics) are ad hoc; all continuum-based models are approximations; and the Deadly Sin of Reification lurks.

We are using the data from Uncertanity, so that those bright souls who own the book can follow along. We are interested in predicting the college grade point of certain individuals at the end of their first year. We spent two sessions defining what we mean by this. We spend more time now on this most crucial question.

This is part of the process most neglected in the headlong rush to get to the computer, a neglect responsible for vast over-certainties.

Now we learned that CGPA is a finite-precision number, a number that belongs to an identifiable set, such as 0, 0.0625, and so on, and we know this because we know the scoring system of grades and we know the possible numbers of classes taken. The finite precision of CGPA can be annoyingly precise. Last time we were out at six or eight decimal places, precision far beyond any decision (except ranking) I can think to make.

To concentrate on this decision I put myself in the mind of a Dean—and immediately began to wonder why all my professors aren’t producing overhead. Laying that aside (but still sharpening my ax) I want to predict the chance any given student will have a CGPA of 0, 1, 2, 3, or 4. These buckets are all I need for the decision at hand. Later, we’ll increase the precision.

Knowing nothing except the grade must be one of these 5 numbers, the probability of a 4 is 1/5. This is the model:

(1) Pr(CGPA = 4 | grading rules),

where “grading rules” is a proposition defining how CGPAs are calculated, and with information of what level of precision that is of interest to us, and possibly to nobody else; “grading rules” tells us CGPA will be in the buckets 0, 1, 2, 3, 4, for instance.

The numerical probability of 1/5 is deduced on the assumptions made; it is therefore the correct probability—given these assumptions. Notice this list of assumptions does not contain all the many things you may also know about GPAs. Many of these bytes of information will be non-quantified and unquantifiable, but if you take cognisance of any of them, they become part of a new model:

(2) Pr(CGPA = 4 | grading rules, E),

where E is a compound proposition containing all the semi-formal and informal things (evidence) you know about GPAs, like e.g. grade inflation. This model depends on E, and thus (2) will not likely give quantified or quantifiable answers. Just because our information doesn’t appear in the formal math does not make (2) not a model; or, said another way, our models are often much more than the formal math. If, say, E is only loose notions on the ubiquity of grade inflation, then (2) might equal “More than a 20% chance, I’ll tell you that much.”

To the data

We have made use of no observations so far, which proves, if it already wasn’t obvious, that observations are not needed to make probability judgments (which is why frequentism fails philosophically), and that our models are often more reliant upon intelligence not contained in (direct) observation.

But since this is a statistics-machine learning-artificial intelligence class, let’s bring some numbers in!

Let’s suppose that the only, the sole, the lone observation of past CGPAs was, say, 3. I mean, I have one old observation of CGPA = 3. I want now to compute

(3) Pr(CGPA = 4 | grading rules, old observation).

Intuitively, we expect (3) to decrease from 1/5 to indicate the increased chance of a new CGPA = 3, because if all we saw was an old 3, there might be something special about 3s. That means we actually have this model, and not (3):

(4) Pr(CGPA = 4 | grading rules, old observation, loose math notions).

There is nothing in the world wrong with model (4); it is the kind of mental model we all use all the time. Importantly, it is not necessarily inferior to this new model:

(5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions),

where we move to formally define how all the parts on the right hand side mathematically relate to the left hand side.

How is this formality conducted?

Well, it can be deduced. Since CGPA can belong only to a fixed, finite set (as “grading rules” insists), we can deduce (5). In what sense? There will be so many future values we want to predict; out of (say) 10 new students, how many As, Bs, etc. are we likely to see and with what chance? This is perfectly doable, but it is almost never done.

The beautious (you heard me: beautious) thing about this deduction is that no parameters are required in (5) (nor are any “hidden layers”, nor is any “training” needed). And since no parameters are required, no “priors” or arguments about priors crop up, and there is no need of hypothesis testing, parameter estimates, confidence intervals, or p-values. We simply produce the deduced probabilities. Which is what we wanted all along!

In Uncertainty, I show this deduction when the number of buckets is 2 (here it is 5). For modest n, the result is close to a well-known continuous-parameterized approximation (with “flat prior”), an approximation we’ll use later.

Here (see the book or this link for the derivation) (5) as an approximation works out to be

(5) Pr(CGPA = 4 | GR, n_3 = 1, fixed math) = (1 + n_4)/(n + 5),

where n_j is the number of js observed in the old data, and n is the number of old data points; thus the probability of a new CGPA = 4 is 1/6; for a new CGPA = 3 it is 2/6; also “fixed math” has a certain meaning we explore next time. Model (5), then, is the answer we have been looking for!

Formally, this is the posterior predictive distribution for a multinomial model with a Dirichlet prior. It is an approximation, valid fully only at “the limit”. As an approximation, for small n, it will exaggerate probabilities, make them sharper than the exact result. (For that exact result for 2 buckets, see the book. If we used the exact result here the probabilities for future CGPAs would with n=1 remain closer to 1/5.)

Now since most extant code and practice revolves around continuous-parameterized approximations, and we can make do with them, we’ll also use them. But we must always keep in mind, and I’ll remind us often, that these are approximations, and that spats about priors and so forth are always distractions. However, as the approximation is part of our right-hand-side assumptions, the models we deduce are still valid models. How to test which models worked best in our decision is a separate problem we’ll come to.

Homework: think about the differences in the models above, and how all are legitimate. Ambitious students can crack open Uncertainty and use it to track down the deduced solution for more than 2 buckets; cf. page 143. Report back to me.