In Part I we had the simplest kind of model. We complicated it in Part II, built more structure in Part III, and today finally come to the most used probability model: the normal distribution.

We also learned in Part III that math, while pleasant to look at, can often lead to absurdities, such as the deduction that the probability of actual observable events is 0. Let’s continue in that line with normal distributions, still thinking about Susy’s GPA. (See this comment about the finiteness of Susy’s GPA.)

Our M(eta information) = “The normal distribution quantifies our uncertainty in Susy’s GPA”. Given M, our normal model is true. But the normal, as most distributions do, comes equipped with baggage in the form of parameters. We can’t just say, “Given the normal distribution the probability of Susy’s GPA equaling e_{i} = 0 for any i in 1 to N.” We have to instead say, “Given the normal distribution with parameters equal to Q and W, the probability of Susy’s GPA equaling e_{i} = 0 for any i in 1 to N.” The first thing to note is the probability of seeing any real observable is 0, just as before. This is an inescapable part of modern statistical theory: real things can’t happen. It is the price we pay for having beautiful theorems.

But never mind that. Focus on the parameters, which for the normal are two. Both values must be set for the model to be specified fully. Here, I chose to set them at Q and W, which are just two numbers. It doesn’t matter to us what they are, except incidentally; knowing Q and W tells us nothing much about what we want to know. Which, lest we forget, is the probability that Susy’s GPA takes a certain value. Once Q and W are specified, we are in business—kind of.

We can’t answer our actual question (the answer is always 0), but we can answer questions like this: “Given our model, what is the probability that Susy’s GPA will be less than G?” where we can fill in the blank for G. Mathematics can at least give non-zero answers to these kinds of questions. Problem with the normal is that the answer will be be > 0 for G < 0 or G > 4^{1}. That is, the normal model will tell us there is definite probability for GPAs we know are false—given our other information, which we are ignoring when we use the normal.

Let’s not pass too quickly over this. The probabilities the normal model gives are *true* given we assume the normal model is true. These probabilities are just as true as the probability George the Martian wears a hat. Can we however falsify the normal model? It says there is positive probability for events we know (given our other information) that we will never see. But in the case of GPAs less than 0 or greater than 4, the model does not say that certain events we know will occur cannot happen, i.e. that they have probability 0. So it can’t be falsified here.

To nauseate us by the repetition, the normal model does say that events we know we will see (given other information) have probability 0. Thus, as soon as we see a printed GPA, we have falsified the normal model. Like Nelson, however, we turn a blind eye to this argument and press on.

But we can’t pretend we don’t see the “probability leakage”, i.e. the probability for GPAs less than 0 or greater than 4. We just hope that this known error isn’t large. But because both frequentist and classical Bayesian analysis spend their time with distractions, like scrutinizing with great intensity the incidental values Q and W, the amount of probability leakage is usually unknown. That is, standard analysis is satisfied to tell us lots of information about Q (and maybe give us a word or two about W), but it forgets the original question, which again is what is the probability Susy’s GPA takes a certain value? Even if you don’t care about the probability-0-for-real-events problem and are satisfied with saying only that there is a probability that Susy’s GPA is less than some value, you can’t ignore the probability leakage (my experience shows it’s usually large and not-ignorable).

Now, as in Part III we can assume we have a probative sample which gives information about Susy’s GPA. Both frequentist and classical Bayesian techniques show how to incorporate this information, but the problem of those parameters creeps in. Turns out that allowing uncountably infinite values for the GPA causes all kinds of difficulties in how to best incorporate the probative sample into the model’s parameters. But never mind. It turns out that both camps give identical answers for the normal model (using an “improper flat prior” in Bayes). Even though this truce has been reached, both camps still forget the original question and fall to discussing Q (and rarely W).

The reason this is a problem is two-fold: (1) we can’t get an answer to our original question, or even its modification (when we ignore the probability-0-for-real-events problem); (2) we can’t fully know how well the model performed in terms of observables.

There is a way to progress beyond (ultimately unanswerable) questions about Q and W and to answer the original question, or rather its modification (Susy’s GPA less than some level). In Bayes, this is called giving the posterior predictive distribution. We actually did this in Parts II and III, though we didn’t name it. We’ll suppose that indeed we are working with probabilities that answer our modified question. We need these to demonstrate model performance, which is our last step.

Who said this would be easy!

——————————————————————————————

^{1}Yes, other distributions besides the normal can be used, at least to fix the less than 0 problem. This isn’t the point, which is what happens when a normal is used, as it often is, in situations like this.

**Update** How this all relates to climate models is coming!

Briggs:

Four posts of a brilliant clarity.

I particularly like the points about the overwhelming importance of the assumptions one is making and the issue of ‘probability leakage’.

I wonder if you could do a post, giving an example or two, where ‘probability leakage’ is important (i.e., “my experience shows it’s usually large and not-ignorable”). I hasten to add that I don’t doubt your claim but examples don’t pop readily into my, admittedly pedestrian, mind.

Heartfelt thanks.

This is a bit OT but germane for presentation. My original background is in engineering and I was taught that the presentation should start by (briefly) outlining the problem to be solved before presenting the solution. Sometimes it’s necessary to show why the problem

isas problem. To do this succinctly may mean making seemingly unfounded statements with the understanding they will be explained later.You may think you have done this but to me it seems you’ve jumped with both feet into the solution before laying out the map of where you are going. It reminds me of trying to pry from brother what his latest marvelous device is meant to accomplish. A typical exchange goes like: “What’s it do?” “Well, this wire goes from this pin to this pin …” “OK, but what’s it do?” “I’m trying to tell you that. This wire goes from this pin to this pin …” He acts genuinely puzzled by the question.

My brother’s approach is probably fine for anyone immersed in the problem as he is but the rest of us are left wondering what he’s up to. The net result is the glory of most of his presentation ends up sounding like so much “blah blah blah”. Many academic papers seem to have similar structure.We get to see all of the nuts and bolts arranged nicely before us before they are assembled in something more useful.

I’m not entirely sure how to apply this to the current series of posts. I think I know here you are going but can’t be certain. I do understand the limited space available and understand your desire to lay background first. To often though this appears as “This wire goes from here to here”. Perhaps, if you are interested, we can pick this up at the end of the series. Maybe via e-mail.

A large dump on the last 4 posts.

Post I — The probability that George is wearing a hat collapses to 0 or 1 upon meeting George. In my meeting with George, he had a large slice of chunk cheese on his head. I said, so it is not true that all Martians wear hats. George said, No, we do. I saw a man from Wisconsin wearing this hat, and I made my own. I replied that a slice of cheese is not a hat.

Post II — balls and urns. In the “real world” our balls in urns problems are along the lines of there is an unknown number of blue balls and red balls in the urn. I can remove some number of balls for the urn and assume that my sample is representative of the balls that remain in the urn, but that still may give me very little insight to what remains in the urn. Suppose I remove 10 balls and count 4 blue. There may be 40% blue balls throughout, but I may have drawn a sample that skewed toward blue, thus diluting the blue concentration among the remaing balls.

Post III — First you say that there is a discrete number of possible GPAs that Susy could have. i.e. she will take 100 – 130 units for a grade over the course of her undergraduate career. GPA = p / q were p and q are positive integers between some bound. You then say that the probability that her GPA = e1 is 0.

However, I am not sure I see the relevance. Suppose you flipp a coin. If you flip 100 times your probability of getting exactly 50 heads is small, even if it is more likely than any other outcome. As the number of flips get large, your probability of getting any specific number of flips gets arbitrarily small. Calling arbitrarily small “zero” doesn’t bother me.

Post IV — I wouldn’t expect the a normal distribution for Susy’s GPA. Some kids are grinds, some are failures. I would expect to see more kids in the 0 – 1.5 range and more in the 3.7 – 4.0 range than the normal model would suggest is likely. The normal model rarely seems to describe any variables I am working with, but it is often close enough and the math easy to work with.

I can’t see where you’re going, but no matter. I can wait for the punchline.

But I don’t understand the two main criticisms of the normal model: the probability 0 problem and probability leakage. Both arise because the normal model is being used to approximate a discrete mass function. Surely you understand this, so that can’t be the problem.

The normal approximation can be very useful (e.g., approximating a binomial with n large), but we should always ask “Is it an appropriate approximation for the system being studied?” The answer isn’t always yes.

Agreed that way too much attention is paid to Q and W and not enough to predicting the next value.

Charlie B,

Sure, the normal can often be and is useful: but it often is not, especially when in the hands of the less numerate. What I’m trying to show here is how and when it might be useful, and when it is how it arose.

Let me give you another criticism, but of the finite discrete view. An advantage of distributions (like the normal) is the “smoothing” they provide. If we were to take the raw empirical frequencies of (say) GPA, these would be quite jumpy, and there is a sense that it is too jumpy. That sense has to be formally quantified, though; it becomes part of the model. It is like assuming linearity. Nothing wrong with that either.

The probability-0 problem can often be large, but this depends on the application. We are also speaking philosophically here. It may make “practical sense” to say “Close enough!”, but that’s far from a proof. We want to understand precisely what is happening before making approximations. Incidentally, these very criticisms began (as many things did) with R.A. Fisher, the great-grandfather of the field.

The probability leakage, however, is often non-ignorable. I’ve seen 5-10% probability “leaked” using normals, which is a huge error. We don’t notice it because we’re too busy looking at the parameters.

David,

I’ll try and give some of these soon.

Doug M,

I: Your probability collapses because your model changes. Pr(C|E3) = 1/2; Pr(C|E3 & C) = 1.

II: Not so. The balls you removed give quite a lot of information about what’s remaining: the formula worked out shows how. Missing in this example is when we don’t know about N, which we assume we know here. However, we can do the same thing (model building) when N is unknown.

III: See my comment about the discreteness in Part III, which answers this.

IV: No, a normal stinks here. But it is often—very often—used in situations like this. Pick up any psychology or sociology journal to see. Anyway, the particular parameterized distribution doesn’t matter. The normal model *is* easy to work with, but that, as I’m sure you agree, is not a complete reason to us it.

DAV,

Well, I’ll blame it on my sub-par abilities to communicate. I’m trying to show what a model is, how they sometimes are deduced and sometimes subjectively constructed, how simplifications (such as using normals) can give unexpected errors, and finally how to judge a model’s worth.

Doug M, JH,

JH had a good criticism in Part III. Susy’s GPA could be (say) 11/3, which requires an infinite number of digits to express. But no system is capable of expressing these digits. Her report card, for example, will probably read 3.67, which acknowledges our finite knowledge.

The thing here is to begin discretely and with finite numbers and only take the limit at the end, if needed to make life easier.

Let me give an example of a bad use of the normal model: In the course of Suzy’s four years in college, she’ll take about 40 courses. If each of these is graded with about 10 levels (A, A-, B+, …, F), then an exact calculation of Suzy’s GPA requires the 40-fold convolution of the 10 level probability mass function. This is hard (at least for hand calculation). But we can use the normal approximation and estimate the convolution easily and accurately. Such a use of the central limit theorem is taught in every text.

But Suzy’s GPA distribution might be way off from those calculated numbers. The normal approximation hides a serious logical flaw in my example.

(Answer: the convolution formula assumes the grades are independent, which is unlikely to be true.)