Class - Applied Statistics

Free Data Science Class: Predictive Case Study 1, Part IV

Review!

Code coming next week!

Last time we decided to put ourselves in the mind of a dean and ask for the chance of CGPA falling into one of these buckets: 0,1,2,3,4. We started with an simple model to characterize our uncertainty in future CGPAs, which was this:

    (5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Now the “fixed math notions” means, in this case, that we uses a parameterized probability multinomial distribution (look it up anywhere). This model via the introduction of non-observable, non-real parameters, little bits of math necessary for the equations to work out, gives the probability of belonging to one of the buckets, which in this case are 5, 0-4.

The parameters themselves are the focus of traditional statistical practice, in both its frequentist and Bayesian flavors. This misplaced concentration came about for at least two reasons: (a) the false belief that probabilities are real and thus so are parameters, at least “at” infinity and (b) the mistaking of knowledge of the parameters for knowledge of observables. The math for parameters (at infinity) is also easier than looking at observables. Probability does not exist, and (of course) we now know knowledge of the parameters is not knowledge of observables. We’ll bypass all of this and keep our vision fixed on what is of real interest.

Machine learning (and AI etc.) have parameters for their models, too, for the most part, but these are usually hidden away and observables are primary. This is a good thing, except that the ML community (we’ll lump all non-statistical probability and “fuzzy” and AI modelers into the ML camp) created for themselves new errors in philosophy. We’ll start discussing these this time.

Our “fixed math notions”, as assumption we made, but with good reason, include selecting a “prior” on the parameters of the model. We chose the Dirichlet; many others are possible. Our notions also selected the model. Thus, as is made clear in the notation, (5) is dependent on the notions. Change them, change answer to (5). But so what? If we change the grading rules we also change the probability. Changing the old observations also changes the probability.

There is an enormous amount of hand-wringing about the priors portion of the notions. Some of the concern is with getting the math right, which is fine. But much is because it is felt there are “correct” priors somewhere out there, usually living at Infinity, and if there are right ones we can worry we might not have found the right ones. There are also many complaints that (5) is reliant on the prior. But (5) is always reliant on the model, too, though few are concerned with that. (5) is dependent on everything we stick on the right hand side, including non-quantifiable evidence, as we saw last time. That (5) changes when we change our notions is not a bug, it is a feature.

The thought among both the statistical and ML communities is that a “correct” model exists, if only we can find it. Yet this is almost never true, except in those rare cases where we deduce the model (as is done in Uncertanity for a simple case). Even deduced models begin with simpler knowns or assumptions. Any time we use a parameterized model (or any ML model) we are making more or less ad hoc assumptions. Parameters always imply lurking infinities, either in measurement clarity or numbers of observations, infinities which will always be lacking in real life.

Let’s be clear: every model is conditional on the assumptions we make. If we knew the causes of the observable (here CGPA; review Part I) we could deduce the model, which would supply extreme probabilities, i.e. 0s and 1s. But since we cannot know the causes of grade points, we can instead opt for correlation models, as statistical and ML models are (any really complex model may have causal elements, such as in physics etc., but these won’t be completely causal and thus will be correlational in output).

This does not mean that our models are wrong. A wrong model would always misclassify and never correctly classify, and it would do so intentionally, as it were. This wrong model would be a causal model, too, only it would purposely lie about the causes in at least some instances.

No, our models are correlational, almost always, and therefore can’t be wrong in the sense just mentioned; neither can they be right in the causal sense. They can, however, be useful.

The conceit by statistical modelers is that, once we have in hand correlates to our observables, which in this case will be SAT scores and high school GPAs, that if we increase our sample size large enough, we’ll know exactly how SAT and HGPA “influence” CGPA. This is false. At best, we’ll sharpen our predictive probabilities to some extent, but we’ll hit a wall and will go no further. This is because SAT scores do not cause CGPAs. We may know all there is to know about some non-interesting parameters inside some ad hoc model, but this intensity will not transfer to the observables, which may be as murky as ever. If this doesn’t make sense, the examples will clarify it.

The similar conceit of the ML crowd is that if only the proper quantity of correlates are measured, and (as with the statisticians) measured in sufficient number, all classification mistakes will disappear. This is false, too. Because unless we can measure all the causes of each and every person’s CGPA, the model will err in the sense that it will not produce extreme probabilities. Perfect classification is a chimera—and a sales pitch.

Just think: we already know we cannot know all the precise causes of many contingent events. For example, quantum measurements. No parameterized model nor the most sophisticated ML/AI/deep learning algorithm in the world, taking all known measurements as input, will classify better than the simple physics model. Perfection is not ours to have.

Next time we finally—finally!—get to the data. But remember we were in no hurry, and that we purposely emphasized the meaning and interpretation of models because there is so much misinformation here, and because these are, in the end, the only parts that matter.

9 replies »

  1. The math for parameters (at infinity) is also easier than looking at observables.

    Parameters (at infinity)? What is the math for parameters (at infinity)? Is it easier than looking at observables?

    A parametric statistical model assumes the pattern in the data via a specific functional form with unknown parameters, e.g, the slope associated with a linear model.

    A nonparametric model does not use an analytic form for the pattern, simply a function. Theoretically, a (continuous) function can be perfectly expressed e.g., by a Taylor’s series expansion, as an infinite sum of terms. Therefore, it is said that the parameter is infinite in dimensions in a nonparametric setting.

    Changing the old observations also changes the probability.

    How do you change the old the observations? Using a marker?

    Given the same premises, the probability of a set or an event of interest does not change due to data or observations, old or new, small or big. However, the estimate of the probability might change because it depends on what information we have.

    No more comments as I don’t even know this will be posted.

  2. How do you change the old the observations? Using a marker?

    I could be cheeky and suggest a review of the history of temperature “adjustments” in climate research, but that would be wrong. Merely omitting, for whatever reason, some data points from a set of old observations is changing the set. That’s probably what Briggs was implying.

  3. All,

    I don’t think we’re getting the point here. Read Parameters Aren’t What You Think. The point of this article is parameters aren’t what you think. What they are I show.

    Now in classical stats there is a class of methods called “non-parameteric” statistics, which is really a form of parametric statistics. Its purpose under frequentism is hypothesis testing, which we eschew. In Bayesian theory its purpose is to find or approximate “the” distribution an observable “has”. Observables don’t “have” distributions. (Repeat that.)

    We’re going Old Old School, where probability is a matter of deductions or inferences from a list of premises (which may be assumpitons, guesses, facts, measurements, and so on).

    The model for an observable can in some cases (actually in many cases) be deduced when the observable is finite and discrete, as all measurements are. There are no parameters for these deductions. We have direct models. Deduced models.

    When limits (of measurement gradation or sample size) are taken, parameters emerge. Classical parameters are therefore always approximations, sometimes good ones, of our actual measurement capability.

    Best way to get it to review the first three classes in detail.

  4. All models fit to real multi-dimension data (not toy) are approximations, approximations due to ignorance. Oh… I’d love to claim I deductively postulate a model for a data set, even though I have no doubt I deductively derive the math required behind all the estimations.

    Interestingly, probabilists look for certainty in the limit/infinity (as the sample size tends to infinity.) They still do. It is part of what frequentist seeks to do. (Kind of like it’s hard to know eactly what the sum -1+1/2-1/3+1/4-1/5+1/6+…+(-1)^n/n is, yet the limit is known.)

    Let’s be real. If you are to analyze GPA as discrete, say Excel stores GPA values to 4 decimal places, one will conclude that for example, P(GPA=3.999999|whatever) is definitely 0.

    But a computer stores the value to 10^16 decimal places, the answer will be different, then P(GPA=3.999999|whatever) or P(GPA=3|whatever) may not zero, and some might argue it can be approximated by 0. It is true we have a different premise that the values are stored in a computer and therefore a different answer. Nothing new. Just like in life, different information can lead to different decision.

    So, is it reasonable to consider GPA as discrete?

    (One would consider the probability that GPA falls in an interval , not at a point value, because it is theoretically continuous.)

  5. The model for an observable can in some cases (actually in many cases) be deduced when the observable is finite and discrete, as all measurements are. There are no parameters for these deductions. We have direct models. Deduced models.

    A real life example please. If you wish, I can provide you some multi-dimension data set with clear questions and decisions to be answered, not the one you want to answer. Data science. Real data, real questions with real consequences.

  6. The model for an observable can in some cases (actually in many cases) be deduced when the observable is finite and discrete, as all measurements are. There are no parameters for these deductions. We have direct models. Deduced models.

    A real life example please. If you wish, I can provide you some multi-dimension data set with clear questions and decisions to be answered, not the one you want to answer. Data science. Real data, real questions with real consequences.

  7. Gary,

    Global temperatures of all sorts are statistics (numbers) estimated based on observations. Their values might change if different estimation/construction/adjustment methods are used. And yes, they may also be different if some observations are not used.

    Whether/how they are observed or estimated plays a role in how one models the uncertainty.

    If, for some justifiable reasons, I decide to estimate the mean income of statisticians in industry by omitting the top and the bottom 1% of the observations ascertained. Have I changed any observations? No. I use a method that omits those observations.

    I will change a observation value if there is a recording error, though.

    I could be cheeky and suggest a review of the history of temperature “adjustments” in climate research, but that would be wrong.

    LOL. This reminds me of someone who said, “I would never call him “short and fat”.”

    And Gary, what are the differences between “real parameter” and “non-real” parameter? Is the mean income (a parameter) of statisticians real or non-real? I am nosy and I want to know the average incomes of all professions. Does it mean I think the mean income is real? My brain hurts when it comes to this kind of weird question.

  8. I really like this thread, with its linkage between assumptions and probabilities unstated.

Leave a Reply

Your email address will not be published. Required fields are marked *