## Free Data Science Class: Predictive Case Study 1, Part IV

*Code coming next week!*

Last time we decided to put ourselves in the mind of a dean and ask for the chance of CGPA falling into one of these buckets: 0,1,2,3,4. We started with an simple model to characterize our uncertainty in future CGPAs, which was this:

(5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Now the “fixed math notions” means, in this case, that we uses a parameterized probability multinomial distribution (look it up anywhere). This model via the introduction of non-observable, non-real *parameters*, little bits of math necessary for the equations to work out, gives the probability of belonging to one of the buckets, which in this case are 5, 0-4.

The parameters themselves are the focus of traditional statistical practice, in both its frequentist and Bayesian flavors. This misplaced concentration came about for at least two reasons: (a) the false belief that probabilities are real and thus so are parameters, at least “at” infinity and (b) the mistaking of knowledge of the parameters for knowledge of observables. The math for parameters (at infinity) is also easier than looking at observables. Probability does not exist, and (of course) we now know knowledge of the parameters is not knowledge of observables. We’ll bypass all of this and keep our vision fixed on what is of real interest.

Machine learning (and AI etc.) have parameters for their models, too, for the most part, but these are usually hidden away and observables are primary. This is a good thing, except that the ML community (we’ll lump all non-statistical probability and “fuzzy” and AI modelers into the ML camp) created for themselves new errors in philosophy. We’ll start discussing these this time.

Our “fixed math notions”, as assumption we made, but with good reason, include selecting a “prior” on the parameters of the model. We chose the Dirichlet; many others are possible. Our notions also selected the model. Thus, as is made clear in the notation, (5) is dependent on the notions. Change them, change answer to (5). But so what? If we change the grading rules we also change the probability. Changing the old observations also changes the probability.

There is an enormous amount of hand-wringing about the priors portion of the notions. Some of the concern is with getting the math right, which is fine. But much is because it is felt there are “correct” priors somewhere out there, usually living at Infinity, and if there are right ones we can worry we might not have found the right ones. There are also many complaints that (5) is reliant on the prior. But (5) is always reliant on the model, too, though few are concerned with that. (5) is dependent on *everything* we stick on the right hand side, including non-quantifiable evidence, as we saw last time. That (5) changes when we change our notions is not a bug, it is a feature.

The thought among both the statistical and ML communities is that a “correct” model exists, if only we can find it. Yet this is almost never true, except in those rare cases where we *deduce* the model (as is done in *Uncertanity* for a simple case). Even deduced models begin with simpler knowns or assumptions. Any time we use a parameterized model (or any ML model) we are making more or less *ad hoc* assumptions. Parameters always imply lurking infinities, either in measurement clarity or numbers of observations, infinities which will *always* be lacking in real life.

Let’s be clear: every model is conditional on the assumptions we make. If we knew the causes of the observable (here CGPA; review Part I) we could deduce the model, which would supply extreme probabilities, i.e. 0s and 1s. But since we cannot know the causes of grade points, we can instead opt for correlation models, as statistical and ML models are (any really complex model may have causal elements, such as in physics etc., but these won’t be completely causal and thus will be correlational in output).

This does not mean that our models are wrong. A wrong model would always misclassify and never correctly classify, and it would do so intentionally, as it were. This wrong model would be a causal model, too, only it would purposely lie about the causes in at least some instances.

No, our models are correlational, almost always, and therefore can’t be wrong in the sense just mentioned; neither can they be right in the causal sense. They can, however, be useful.

The conceit by statistical modelers is that, once we have in hand correlates to our observables, which in this case will be SAT scores and high school GPAs, that if we increase our sample size large enough, we’ll know exactly how SAT and HGPA “influence” CGPA. This is false. At best, we’ll sharpen our predictive probabilities to some extent, but we’ll hit a wall and will go no further. This is because SAT scores do not *cause* CGPAs. We may know all there is to know about some non-interesting parameters inside some *ad hoc* model, but this intensity will not transfer to the observables, which may be as murky as ever. If this doesn’t make sense, the examples will clarify it.

The similar conceit of the ML crowd is that if only the proper quantity of correlates are measured, and (as with the statisticians) measured in sufficient number, all classification mistakes will disappear. This is false, too. Because unless we can measure *all* the causes of *each and every* person’s CGPA, the model will err in the sense that it will not produce extreme probabilities. Perfect classification is a chimera—and a sales pitch.

Just think: we already know we cannot know all the precise causes of many contingent events. For example, quantum measurements. No parameterized model nor the most sophisticated ML/AI/deep learning algorithm in the world, taking all known measurements as input, will classify better than the simple physics model. Perfection is not ours to have.

Next time we finally—finally!—get to the data. But remember we were in no hurry, and that we purposely emphasized the *meaning* and *interpretation* of models because there is so much misinformation here, and because these are, in the end, the only parts that matter.