Thanks to everybody who sent in links, story tips, and suggestions. Because of my recent travel and pressures of work, I’m (again) way behind in answering these. I do appreciate you’re taking the effort to send these in, but sometimes it takes me quite a while to get to them. I need a secretary!
First I enjoy your site. I have a technical education (engineering) but was never required to develop an in depth understanding of statistic.
I tend to be a natural skeptic of almost all things. One of my “hobbies” is following “bad science”. It seems that this is more common than most people realize, especially in medical, economic, psychology, and sociology. (All systems that are non-linear and controlled by large numbers of variables.) I think climate falls into this category.
I don’t expect “personal” response to this, but perhaps you could address it on your site someday.
I once read story where a noted hydrologist who was being honored at MIT, was summarizing some of his research and he mentioned this…”precipitation was not a normal distribution”. It has fat tails. (That may be why we always complain that that it raining too much or too little….because rain fall is seldom average.)
My question is this, when a phenomenon is not a normal distribution, and assumed to be so, how could this affect the analysis?
Precipitation does not “have” a normal distribution. Temperature does not “have” a normal distribution. No thing “has” a normal distributed. Thus it always a mistake to say, for example, “precipitation is normally distributed” or a mistake to say “temperature is normally distributed.” Just as it is always wrong to say, “X is normally distributed” where X is some observable thing.
What we really have are actual values of precipitation, actual values of temperature, actual measurements of some X. Now, we can go back in time and collect these actual values, say for precip, and plot these. Some of these values will be low, more will be in some middle range, and a few will be high. A histogram of these values might even looked vaguely “bell-shaped”.
But no matter how close this histogram of actual values resembles the curve of a normal distribution, precipitation is still not normally distributed. Nothing is.
What is proper to say, and must be understood before we can tackle your main question, is that our uncertainty in precipitation is quantified by a normal distribution. Saying instead, and wrongly, that precipitation is normally distribution leads to the mortal sin of reification. This is when we substitute a model for reality, and come to believe the unreality more than in the truth.
Normal distributions can be used to model our uncertainty in precipitation. To the extent these modeled predictions of a normal are accurate they can be useful. But in no sense is the model—this uncertainty model—the reality.
Now it will often be the case when quantifying our uncertainty in some X with a normal that the predictions are not useful, especially for large or small values of the X. For example, the normal model may say there is a 5% chance that X will be larger than Y, where Y is some large number that takes our fancy. But if we look back at these predictions we see that Y or larger occurs (for example) 10% of the time. This means the normal model is under-predicting the chance of large values.
There are other models of uncertainty for X we can use, perhaps an extreme value distribution (EVD). The EVD model may say that there is a 9% chance that X will be larger than Y. Then, to the extent that these predictions matter to you—perhaps you are betting stocks or making other decisions based on these predictions—then you’d rather go with a model which better represents the actual uncertainty. But it would be just as wrong to say that X is EV distributed.
The central limit theorem (there are many versions) says that certain functions of X will in the long run “go” or converge to a normal distribution. Two things. One: it is still only uncertainty in these functions which converges to normal. Two: we recall Keynes who rightly said “in the long run we shall all be dead.”
Your post gave me a much needed kick in the pants.. I’ve been looking at things as distributions lately without listing to that nagging voice telling me the distribution is a parameter in a model.
Are the any methods you know of for modelling uncertainty for complex models? If I build a complex non linear model using some whacky method (neural nets, support vectors, etc…) I can’t make pretty curves regarding uncertainty. One idea I’ve toyed with is to build yet another model, using the residuals of the first, to help predict uncertainty, but of course this too would have flaws. What does the good Doctor prescribe?
Briggs, I get your point, but you failed to address the meat of the question…
Doug M,
Really? The meat, I thought, was what happened when you used a normal model to quantify uncertainty when better models which quantify uncertainty in some X exist. The danger is you under- or over-estimate the uncertainty for particular values of Y. To the extent you make use of the predictions, when the uncertainty in X is over- or under-estimated, you will suffer. And that’s it.
We could do some numerical examples. If I had the time!
Worse than just reification is when the weather reporter says “the high today should be 75F but we only reached 65F”. The idea that deviation from *normal*, which is what it *should be*, somehow cheated us out of nice warmer weather.
You know it! All models are wrong….. etc. GEP Box: “every statistician knows [or should know!] that there never was such a thing as a normal distribution, or a straight line” etc
Ok, so this is the one thing regarding statistics I could never understand from the textbooks…
Why does my textbook tell me to use a normal distribution if the sample size is “large enough?” Maybe said differently: if these distribution functions are a model of uncertainty (not the actual data), how do we get them? And how do we choose which one to use?
To use the example of precipitation given here: the histogram of the data might look normally distributed, but that is not the point. Precipitation is not normally distributed, our uncertainty is. So how do we get that distribution of uncertainty. How do we get any of them? And how do we pick which distribution of uncertainty models the data correctly?
Every time I think I understand it, I get lost again.
In electronics, thermal noise is as close to Gaussian as any level of accuracy requires!
Meanwhile – the central limit theorem – never comes close to working.
It’s obvious that precipitation can’t be distributed normally since it can never be negative. But we’re not going to use the normal distribution for that; we’re going to use it to model our uncertainty. Also, the model to use is something we choose. So couldn’t we choose to use the normal distribution to wonder about future rainfall and just ignore the bit that’s less than zero? Or should we start thinking about the real-world impact our choice of statistical model might have?
Jason: the uncertainty is a model. So you pick whichever one works best. You could use any curve as a model of uncertainty, but some will obviously “fit” better than others. If your model isn’t perfect then any estimate of error is just that: an estimate.
Someone might say “the scores on the last exam were normally distributed” to mean just that the histogram looks approximately normal. I don’t see the leaving out of “approximately” as a problem, as it is generally understood. If I say “it’s 70 degrees outside” or “the wall meets the floor at a right angle” an “approximately” is similarly understood.
Rich,
Suppose you were to bin the data. That’s what a histogram does. The height of each column represent the density in that bin. You could use the area of each bin relative to all of the other bins as a reasonable estimate. That’s what you’re doing when you integrate between two limits of a density curve.
Now overlay the histogram with a normal distribution curve. Does it (more or less) match the heights of the histogram? You could use that part of the normal curve to estimate (presumably future or unseen) densities but if you truncate it at either end then the area will not sum to 1. If you use standard functions like pnorm then you get incorrect values. You would have to normalize to the area you actually want.
You get into complications when using off-the-shelf regression software when doing this because most assume the full distribution is being used to model uncertainty. Briggs referred to it as leaking probability I think.
Yeah, that’s it! Anyway, model misspecification can result in under- or over--estimation. Are there any other possibilities? Left- and right- estimation? East- and west-estimation? ^_^
Generally, a capital letter (X) is used to denote a random variable, and a lower case (x) a specific value of a random variable. A probability distribution is employed to model X, not x.
A normal model may be used to approximate the uncertainty in the underlying data generating process of X. It does not mean that X follows exactly and truly a normal. It seems impossible to prove that any generating mechanism would follow an exact known distribution. Hence all models are wrong, and nothing in nature follows a normal distribution exactly.
Being pedantic is fashionable?!
Central limit theorem (CLT) deals with the probability distribution of sample means. For example, if I want to compute the probability that a sample average (mean), e.g.,25-year average monthly rainfall, is no more than 2 inches? How can I calculate this probability without knowing the underlying distribution of (the uncertainly in the generating process of… too many words to type) monthly rainfall? CLT says that, if the sample size is large enough, the probability can be closely approximated by using a normal probability distribution. Presumably, the approximation performs better as the sample size becomes larger.
[H]ow do we pick which distribution of uncertainty models the data correctly?
In the simplest case when there is only one-variable, a histogram is a good start! Your knowledge about the variable should also come to play. If you cannot ignore the impossibility of negative values, you may choose a Gamma distribution, which is a rich class of distributions. (Or a Tobit model can be employed in the context of simple linear regression.) Whether a distribution is employed correctly? I don’t know, however, there are criteria for comparing models.
I have a PDF from Price-Waterhouse-Coopers entitled, “Uncertainty and Risk Analysis” which says, in the section, “Which distribution is appropriate”, “Triangular distribution: This is the most commonly used distribution. It has no theoretical justification; however, it is a very simple and clear distribution to use”. It’s my uncertainty and I’ll model it how I like, see.
if I want to compute the probability that a sample average (mean), e.g.,25-year average monthly rainfall, is no more than 2 inches? How can I calculate this probability without knowing the underlying distribution of (the uncertainly in the generating process of… too many words to type) monthly rainfall?
Normal distributions work best as distributions of statistics. If you have 25 years of annual averages for some value, one can test the hypothesis that the next value observed came from a distribution with the same mean. You can look at averages of temperatures from two periods and test the hypothesis that the two periods share the same mean. These are well known statistical tests.
It doesn’t matter how far the distribution of temperatures deviates from normal, the average temperature will have an approximately normal distribution. And it doesn’t take a great many observations for the normal approximation to be quite good. Sums of binomial random variables comes quite close to normal with as few as a few dozen observations.
I would add that sums of triangular distributions become approximately normal much quicker than sums of binomials.