## Free Data Science Class: Predictive Case Study 1, Part III

You *must* review: Part I, II. Not reviewing is like coming to class late and saying “What did I miss?” Note the *New & Improved title!*

Here are the main points thus far: All probability is conditional on the assumptions made; not all probability is quantifiable or must involve observables; all analysis must revolve on ultimate decisions; unless deduced, all models (AI, ML, statistics) are *ad hoc*; all continuum-based models are approximations; and the Deadly Sin of Reification lurks.

We are using the data from *Uncertanity*, so that those bright souls who own the book can follow along. We are interested in predicting the college grade point of certain individuals at the end of their first year. We spent two sessions defining what we mean by this. We spend more time now on this most crucial question.

This is part of the process most neglected in the headlong rush to get to the computer, a neglect responsible for vast over-certainties.

Now we learned that CGPA is a finite-precision number, a number that belongs to an identifiable set, such as 0, 0.0625, and so on, and we know this because we know the scoring system of grades and we know the possible numbers of classes taken. The finite precision of CGPA can be annoyingly precise. Last time we were out at six or eight decimal places, precision far beyond any decision (except ranking) I can think to make.

To concentrate on this decision I put myself in the mind of a Dean—and immediately began to wonder why all my professors aren’t producing overhead. Laying that aside (but still sharpening my ax) I want to predict the chance any given student will have a CGPA of 0, 1, 2, 3, or 4. These buckets are all I need for the decision at hand. Later, we’ll increase the precision.

Knowing *nothing* *except* the grade must be one of these 5 numbers, the probability of a 4 is 1/5. This is the model:

(1) Pr(CGPA = 4 | grading rules),

where “grading rules” is a proposition defining how CGPAs are calculated, and with information of what level of precision that is of interest to us, and possibly to nobody else; “grading rules” tells us CGPA will be in the buckets 0, 1, 2, 3, 4, for instance.

The numerical probability of 1/5 is *deduced* on the assumptions made; it is therefore the correct probability—given these assumptions. Notice this list of assumptions does *not* contain *all* the many things you may also know about GPAs. Many of these bytes of information will be non-quantified and unquantifiable, but if you take cognisance of any of them, they become part of a *new* model:

(2) Pr(CGPA = 4 | grading rules, E),

where E is a compound proposition containing all the semi-formal and informal things (evidence) you know about GPAs, like e.g. grade inflation. This model depends on E, and thus (2) will not likely give quantified or quantifiable answers. Just because our information doesn’t appear in the formal math does not make (2) not a model; or, said another way, our models are often *much* more than the formal math. If, say, E is *only* loose notions on the ubiquity of grade inflation, then (2) might equal “More than a 20% chance, I’ll tell you that much.”

**To the data**

We have made use of no observations so far, which proves, if it already wasn’t obvious, that observations are not needed to make probability judgments (which is why frequentism fails philosophically), and that our models are often more reliant upon intelligence not contained in (direct) observation.

But since this is a statistics-machine learning-artificial intelligence class, let’s bring some numbers in!

Let’s suppose that the only, the sole, the lone observation of past CGPAs was, say, 3. I mean, I have one old observation of CGPA = 3. I want now to compute

(3) Pr(CGPA = 4 | grading rules, old observation).

Intuitively, we expect (3) to decrease from 1/5 to indicate the increased chance of a new CGPA = 3, because if all we saw was an old 3, there might be something special about 3s. That means we actually have this model, and *not* (3):

(4) Pr(CGPA = 4 | grading rules, old observation, loose math notions).

There is nothing in the world wrong with model (4); it is the kind of mental model we all use all the time. Importantly, it is not necessarily inferior to this new model:

(5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions),

where we move to formally define how all the parts on the right hand side mathematically relate to the left hand side.

How is this formality conducted?

Well, it can be *deduced*. Since CGPA can belong only to a fixed, finite set (as “grading rules” insists), we can deduce (5). In what sense? There will be so many future values we want to predict; out of (say) 10 new students, how many As, Bs, etc. are we likely to see and with what chance? This is perfectly doable, but it is almost never done.

The beautious (you heard me: *beautious*) thing about this deduction is that *no parameters are required* in (5) (nor are any “hidden layers”, nor is any “training” needed). And since no parameters are required, no “priors” or arguments about priors crop up, and there is no need of hypothesis testing, parameter estimates, confidence intervals, or p-values. We simply produce the deduced probabilities. Which is what we wanted all along!

In *Uncertainty*, I show this deduction when the number of buckets is 2 (here it is 5). For modest n, the result is close to a well-known continuous-parameterized approximation (with “flat prior”), an approximation we’ll use later.

Here (see the book or this link for the derivation) (5) as an approximation works out to be

(5) Pr(CGPA = 4 | GR, n_3 = 1, fixed math) = (1 + n_4)/(n + 5),

where n_j is the number of js observed in the old data, and n is the number of old data points; thus the probability of a new CGPA = 4 is 1/6; for a new CGPA = 3 it is 2/6; also “fixed math” has a certain meaning we explore next time. Model (5), then, is *the* answer we have been looking for!

Formally, this is the posterior predictive distribution for a multinomial model with a Dirichlet prior. It *is* an approximation, valid fully only at “the limit”. As an approximation, for small n, it will exaggerate probabilities, make them sharper than the exact result. (For that exact result for 2 buckets, see the book. If we used the exact result here the probabilities for future CGPAs would with n=1 remain closer to 1/5.)

Now since most extant code and practice revolves around continuous-parameterized approximations, and we can make do with them, we’ll also use them. But we must always keep in mind, and I’ll remind us often, that these *are* approximations, and that spats about priors and so forth are always distractions. However, as the approximation is part of our right-hand-side assumptions, the models we deduce are still valid models. How to test which models worked best in our decision is a separate problem we’ll come to.

Homework: think about the differences in the models above, and how all are legitimate. Ambitious students can crack open *Uncertainty* and use it to track down the deduced solution for more than 2 buckets; cf. page 143. Report back to me.