You must review: Part I, II. Not reviewing is like coming to class late and saying “What did I miss?” Note the New & Improved title!
Here are the main points thus far: All probability is conditional on the assumptions made; not all probability is quantifiable or must involve observables; all analysis must revolve on ultimate decisions; unless deduced, all models (AI, ML, statistics) are ad hoc; all continuum-based models are approximations; and the Deadly Sin of Reification lurks.
We are using the data from Uncertanity, so that those bright souls who own the book can follow along. We are interested in predicting the college grade point of certain individuals at the end of their first year. We spent two sessions defining what we mean by this. We spend more time now on this most crucial question.
This is part of the process most neglected in the headlong rush to get to the computer, a neglect responsible for vast over-certainties.
Now we learned that CGPA is a finite-precision number, a number that belongs to an identifiable set, such as 0, 0.0625, and so on, and we know this because we know the scoring system of grades and we know the possible numbers of classes taken. The finite precision of CGPA can be annoyingly precise. Last time we were out at six or eight decimal places, precision far beyond any decision (except ranking) I can think to make.
To concentrate on this decision I put myself in the mind of a Dean—and immediately began to wonder why all my professors aren’t producing overhead. Laying that aside (but still sharpening my ax) I want to predict the chance any given student will have a CGPA of 0, 1, 2, 3, or 4. These buckets are all I need for the decision at hand. Later, we’ll increase the precision.
Knowing nothing except the grade must be one of these 5 numbers, the probability of a 4 is 1/5. This is the model:
(1) Pr(CGPA = 4 | grading rules),
where “grading rules” is a proposition defining how CGPAs are calculated, and with information of what level of precision that is of interest to us, and possibly to nobody else; “grading rules” tells us CGPA will be in the buckets 0, 1, 2, 3, 4, for instance.
The numerical probability of 1/5 is deduced on the assumptions made; it is therefore the correct probability—given these assumptions. Notice this list of assumptions does not contain all the many things you may also know about GPAs. Many of these bytes of information will be non-quantified and unquantifiable, but if you take cognisance of any of them, they become part of a new model:
(2) Pr(CGPA = 4 | grading rules, E),
where E is a compound proposition containing all the semi-formal and informal things (evidence) you know about GPAs, like e.g. grade inflation. This model depends on E, and thus (2) will not likely give quantified or quantifiable answers. Just because our information doesn’t appear in the formal math does not make (2) not a model; or, said another way, our models are often much more than the formal math. If, say, E is only loose notions on the ubiquity of grade inflation, then (2) might equal “More than a 20% chance, I’ll tell you that much.”
To the data
We have made use of no observations so far, which proves, if it already wasn’t obvious, that observations are not needed to make probability judgments (which is why frequentism fails philosophically), and that our models are often more reliant upon intelligence not contained in (direct) observation.
But since this is a statistics-machine learning-artificial intelligence class, let’s bring some numbers in!
Let’s suppose that the only, the sole, the lone observation of past CGPAs was, say, 3. I mean, I have one old observation of CGPA = 3. I want now to compute
(3) Pr(CGPA = 4 | grading rules, old observation).
Intuitively, we expect (3) to decrease from 1/5 to indicate the increased chance of a new CGPA = 3, because if all we saw was an old 3, there might be something special about 3s. That means we actually have this model, and not (3):
(4) Pr(CGPA = 4 | grading rules, old observation, loose math notions).
There is nothing in the world wrong with model (4); it is the kind of mental model we all use all the time. Importantly, it is not necessarily inferior to this new model:
(5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions),
where we move to formally define how all the parts on the right hand side mathematically relate to the left hand side.
How is this formality conducted?
Well, it can be deduced. Since CGPA can belong only to a fixed, finite set (as “grading rules” insists), we can deduce (5). In what sense? There will be so many future values we want to predict; out of (say) 10 new students, how many As, Bs, etc. are we likely to see and with what chance? This is perfectly doable, but it is almost never done.
The beautious (you heard me: beautious) thing about this deduction is that no parameters are required in (5) (nor are any “hidden layers”, nor is any “training” needed). And since no parameters are required, no “priors” or arguments about priors crop up, and there is no need of hypothesis testing, parameter estimates, confidence intervals, or p-values. We simply produce the deduced probabilities. Which is what we wanted all along!
In Uncertainty, I show this deduction when the number of buckets is 2 (here it is 5). For modest n, the result is close to a well-known continuous-parameterized approximation (with “flat prior”), an approximation we’ll use later.
Here (see the book or this link for the derivation) (5) as an approximation works out to be
(5) Pr(CGPA = 4 | GR, n_3 = 1, fixed math) = (1 + n_4)/(n + 5),
where n_j is the number of js observed in the old data, and n is the number of old data points; thus the probability of a new CGPA = 4 is 1/6; for a new CGPA = 3 it is 2/6; also “fixed math” has a certain meaning we explore next time. Model (5), then, is the answer we have been looking for!
Formally, this is the posterior predictive distribution for a multinomial model with a Dirichlet prior. It is an approximation, valid fully only at “the limit”. As an approximation, for small n, it will exaggerate probabilities, make them sharper than the exact result. (For that exact result for 2 buckets, see the book. If we used the exact result here the probabilities for future CGPAs would with n=1 remain closer to 1/5.)
Now since most extant code and practice revolves around continuous-parameterized approximations, and we can make do with them, we’ll also use them. But we must always keep in mind, and I’ll remind us often, that these are approximations, and that spats about priors and so forth are always distractions. However, as the approximation is part of our right-hand-side assumptions, the models we deduce are still valid models. How to test which models worked best in our decision is a separate problem we’ll come to.
Homework: think about the differences in the models above, and how all are legitimate. Ambitious students can crack open Uncertainty and use it to track down the deduced solution for more than 2 buckets; cf. page 143. Report back to me.
New and Improved often means repackaged in an effort to sell to new suckers. It’s not a term I would ever use.
Interesting lesson, though.
Bob Kurland made this remark/observation at the end of Part 1:
“…when people do call the researcher to task for misusing statistics, as was done with one AGW proponent, they get sued (as did Mark Steyn and National Review). The prospect of legal entanglement does inhibit calling fakirs out, I would think.”
It’s a good point — “legal entanglement” and too real in our world.
Briggs says at the end, today: “…think about the differences in the models above, and how all are legitimate.”
There’s an implicit, potential, flaw there — all models are “legitimate” only to some degree, some limited validity.
Bob’s & Briggs’ points too often link, in the real world, in bizarre ways —
– When a decision violates some held belief or value (e.g. global warming is NOT an issue of concern; on emotional themes emotional thinking too often creeps in, an analysis that doesn’t support, but doesn’t refute either, some belief might cause a researcher all kinds of trouble, which is nonsense if the research is merely too limited to reach a determination…but that happens too).
– The “poster case” for such seems to come, of late, from Italy:
— Scientists prosecuted in court for not predicting an earthquake or its severe effects. No model of that phenomena has progressed to a point where such an expectation is even remotely reasonable.
— Prosecutor Mignini’s rabid and nonsensical grounds (sans any DNA or other credible physical evidence) for prosecuting U.S. citizen Knox for a murder, AFTER the actual killer had been convicted and imprisoned. A decision that was overturned (sensibility prevailed…but for how long will that persist?).
A point being made by illustration is, in today’s topsy turvy world, emotional reasoning (a type of mental defect, when it occurs) is gaining ground as valid justifications for legal policies, and, court decisions. …to such an extent that such becomes suitable discussion in a topic area that should be so quantitatively emotionless as stats….
Pingback: Free Data Science Class: Predictive Case Study 1, Part IV – William M. Briggs
Why use the approximation derived via a PARAMETRIC model when you have a deductive method?
Old old school probabilists used the non-parametric empirical distribution to estimate the probability of an event or a set.
Why use the approximation derived via a PARAMETRIC model when you have a deductive method?
Excellent question. Because we don’t have the math worked out for the deductive method as nearly as well as we do for parametric cases. See the link on the parameters article were we agree with Simpson that the math for finite discrete problems is harder than for continuous problems. Part of this is lack of experience. Once deductive approaches are common, the math won’t seem nearly so hard.
Update Another reason to use, at least for now, parametric methods, is software. Lots of it for parametric models which we can repurpose. For deductive methods, we’re largely coding from scratch. This of course slows down adoption. I discuss this more in future lectures.
Old old school probabilists used the non-parametric empirical distribution to estimate the probability of an event or a set.
Old school ones, yes, but not Old Old School ones. As I write about in Uncertainty, classical approaches in probability theory were empirically biased, or logically blind. I gave in logic Lewis Carroll’s French-speaking cat example. There are no real French-speaking cats, but we can still examine the logic of propositions containing them. Logic is the study of the connections between propositions. Probability, too. Probability then can also apply to non-empirical examples. There are in these examples no “non-parametric empirical distributions”. We must use deduction.
If there are observations, one can definitely use empirical probability distribution to estimate the unknown probability . Empirically biased? Logically blind? Are we taking about the same “empirical probility distribution”? Yes, you might say it is biased because we don’t know the true answer to the probability of interest.
Some people say that the so-called principle of ignorance is classical probability. You might want to claim it to be logical, but some people argue it is simply a reaonable choice out of no better choice. And such choice is needed for some reason.
Yes, probability can definitely apply to non-empirical examples. No one denies that. Yes, we can deduce the probability given well defined premises. In this case, why would you need DATA science? If one needs data for verification, then one admits the probability is not deduced.