This is where it starts to get complicated, this is where old school statistics and new school start diverging. And I don’t even start the new new school.
Parameters are defined and then heavily deemphasized. Nearly all of old and new school statistics entire purpose is devoted to unobservable parameters. This is very unfortunate, because people go away from a parameter analysis far, far too certain about what is of real interest. Which is to say, observable data. New new school statistics acknowledges this, but not until Chap 9.
Confidence intervals are introduced and fully disparaged. Few people can remember that a confidence interval has no meaning; which is a polite way of saying they are meaningless. In finite samples of data, that is, which are the only samples I know about. The key bit of fun is summarized. You can only make one statement about your confidence interval, i.e. the interval you created using your observed data, and it is this: this interval either contains the true value of the parameter or it does not. Isn’t that exciting?
Some, or all, of the Greek letter below might not show up on your screen. Sorry about that. I haven’t the time to make the blog posting look as pretty as the PDF file. Consider this, as always, a teaser.
For more fun, read the chapter: Here is the link.
CHAPTER 8
Estimating
1. Background
Let?s go back to the petanque example, where we wanted to quantify our uncertainty in the distance x the boule landed from the cochonette. We approximated this using a normal distribution with parameters m = 0 cm and s = 10 cm. With these parameters in hand, we could easily quantify uncertainty in questions like X = “The boule will land at least 17 cm away” with the formula Pr(X|m = 0 cm, s = 10 cm, EN ) = Pr(x > 17 cm|m = 0 cm, s = 10 cm, EN ). R even gave us the number with 1-pnorm(17,0,10) (about 4.5%). But where did the values of m = 0 cm and s = 10 cm come from?
I made them up.
It was easy to compute the probability of statements like X when we knew the probability distribution quantifying its uncertainty and the value of that distribution?s parameters. In the petanque example, this meant knowing that EN was true and also knowing the values of m and s. Here, knowing means just what it says: knowing for certain. But most of the time we do not know EN is true, nor do we know the values of m and s. In this Chapter, we will assume we do in fact know EN is true. We won?t question that assumption until a few Chapters down the road. But, even given EN is true, we still have to discern the values of its parameters somehow.
So how do we learn what these values are? There are some situations where are able to deduce either some or all of the parameter’s values, but these situations are shockingly few in number. Nearly all the time, we are forced to guess. Now, if we do guess?and there is nothing wrong with guessing when you do not know?it should be clear that we will not be certain that the values we guessed are the correct ones. That is to say, we will be uncertain, and when we are uncertain what do we do? We quantify our uncertainty using probability.
At least, that is what we do nowadays. But then-a-days, people did not quantify their uncertainty in the guesses they made. They just made the guesses, said some odd things, and then stopped. We will not stop. We will quantify our uncertainty in the parameters and then go back to what is of main interest, questions like what is the probability that X is true? X is called an observable, in the sense that it is a statement about an observable number x, in this case an actual, measurable distance. We do not care about the parameter values per se. We need to make a guess at them, yes, otherwise we could not get the probability of X. But the fact that a parameter has a particular value is usually not of great interest.
It isn’t of tremendous interest nowadays, but again, then-a-days, it was the only interest. Like I said, people developed a method to guess the parameter values, made the guess, then stopped. This has led people to be far too certain of themselves, because it?s easy to get confused about the values of the parameters and the values of the observables. And when I tell you that then-a-days was only as far away as yesterday, you might start to be concerned.
Nearly all of classical statistics, and most of Bayesian statistics is concerned with parameters. The advantage the latter method has over the former, is that Bayesian statistics acknowledges the uncertainty in the parameters guesses and quantifies that uncertainty using probability. Classical statistics?still the dominate method in use by non-statisticians1?makes some bizarre statements in order to avoid directly mentioning uncertainty. Since classical statistics is ubiquitous, you will have to learn these methods so you can understand the claims people (attempt to) make.
So we start with making guesses about parameters in both the old and new ways. After we finish with that, we will return to reality and talk about observables.
2. Parameters and Observables
Here is the situation: you have never heard of petanque before and do not know a boule from a bowl from a hole in the ground. You know that you have to quantify x, which is some kind of distance. You are assuming that EN is true, and so you know you have to specify m and s before you can make a guess about any value of x.
Before we get too far, let?s set up the problem. When we know the values of the parameters, like we have so far, we write them in Latin letters, like m and s for the Normal, or p for the binomial. We always write unknown and unobservable parameters as Greek letters, usually ? and ? for the normal and ? for the binomial. Here is the normal distribution (density function) written with unknown parameters:
(see the book)
where ? is the central parameter, and ? 2 is the variance parameter, and where the equation is written as a function of the two unknowns, N(?, ?). This emphasizes that we have a different uncertainty in x for every possible value of ? and ? (it makes no difference if we talk of ? or ? 2 , one is just the square root of the other).
You may have wondered what was meant by that phrase “unobservable parameters” last paragraph (if not, you should have wondered). Here is a key fact that you must always remember: not you, not me, not anybody, can ever measure the value of a parameter (of a probability distribution). They simply cannot be seen. We cannot even see the parameters when we know their values. Parameters do not exist in nature as physical, measurable entities. If you like, you can think of them as guides for helping us understand the uncertainty of observables. We can, for example, observe the distance the boule lands from the cochonette. We cannot, however, observe the m even if we know its value, and we cannot observe ? either. Observables, the reason for creating the probability distributions in the first place, must always be of primary interest for this reason.
So how do we learn about the parameters if we cannot observe them? Usually, we have some past data, past values of x, that we can use to tell us something about that distribution?s parameters. The information we gather about the parameters then tell us something about data we have not yet seen, which is usually future data. For example, suppose we have gathered the results of hundreds, say 200, of past throws of boules. What can we say about this past data? We can calculate the arithmetic mean of it, the median, the various quantiles and so on. We can say this many throws were greater than 20 cm, this many less. We can calculate any function of the observed data we want (means and medians etc. are just functions of the data), and we can make all these calculations never knowing, or even needing to know, what the parameter values are. Let me be clear: we can make just about any statement we want about the past observed data and we never need to know the parameter values! What possible good are they if all we wanted to know was about the past data?
There is only one reason to learn anything about the parameters. This is to make statements about future data (or to make statements about data that we have not yet seen, though that data may be old; we just haven?t seen it yet; say archaeological data; all that matters is that the data is unknown to you; and what does “unknown” mean?). That is it. Take your time to understand this. We have, in hand, a collection of data xold , and we know we can compute any function (mean etc.) we want of it, but we know we will, at some time, see new data xnew (data we have not yet seen), and we want to now say something about this xnew . We want to quantify our uncertainty in xnew , and to do that we need a probability distribution, and a probability distribution needs parameters.
The main point again: we use old data to make statements about data we have not yet seen.
3. Classical guess
We need to find some way to map our evidence E and the past values of x into information about the parameters. There are lots of different ways to guess at parameter values, some easy and some hard, and these all fall into two broad classifications: yes, a classical and a modern.
We have past values of x and we want to know about future, or at least other, unknown values of x. Our evidence is E, which at least means that we know the probability distribution (Normal, say) of the observables. In this book we will also assume that E also means that knowledge of each individual observation is irrelevant to knowing what each other observation with be. We have to find a way to guess, or estimate, these unknown and unobservable parameters given E and the old data xold .
The classical way to do this is to pick an ad hoc function of the old data and label it f (xold ) = ?, where that “hat” indicates that the value of ? is only a guess. Most classical estimates have the goal that the estimate is “unbiased”, or Ex (? ? ?) = Ex (? ? f (xold )) = 0, meaning that the expected distance between the actual value of ? and the guess ? is 0. Sounds like a nice thing to have, unbiasedness, and it surely isn?t a bad idea, but it turns out to cause a lot of problems, most of which I cannot tell you about without introducing a lot of math. However, this criterion is not compelling because of that expected value business. Expected value with respect to what? Well, with respect to an infinite number of future (not yet observed) data x…which is just the data that we are trying to quantify the uncertainty of. Anyway, in R, to estimate the parameters of a normal distribution classically is easy, and you already know how to do it! If x is our old, previously observed data, x1 , x2 , …, xn , then
? = mean(x) ? = sd(x)
The mean you already know to calculate. It is often written bar(x), and called “x bar”. When you see a data value with a bar over it, you know it is a mean. The observed variance of old data is (see the book), and the observed standard deviation of old data is the square root of that. Look at the formula and notice that the standard deviation is a measure of how far, on average, the old data values are away from the observed mean. The square is taken, (xi ? bar(x))^22 , so that data values that were lower than the observed mean are treated the same as data values that were higher. (If you have missing data in x, recall Chapter 7, where we had to modify the function like this mean(x, na.rm=T; same for the sd function).
We?ll never calculate the observed standard deviation by hand. But it’s pretty convenient to have the observed mean stand in for our guess of ?. Unfortunately, because ? = mean, a lot of people have taken to calling ? (without the hat) the mean, which it most assuredly is not. ? is an unobservable parameter, while the mean is just the weighted sum of a bunch of data we have already observed. This is a subject that I’ll return to later.
4. Confidence intervals
OK, it might have been hard to understand all this so far, but it’s about to get weird, so be steady. The value ? we got before was precise; it is a known, observed number (it is the mean). But do we really believe, given the data and other evidence, that the exact, all-time, incorruptible, immutable value of ? is, to as many decimal places as you like, equal to ?? You may have guessed, by the subtle way I?ve asked that question, that the answer is “no.” And you?d be right! Suppose ? = 5.41. Maybe ? is 5.41, but it might also be, say, 5.40, or 5.39, or other values close by, mightn’t it? This is a fancy way to state that we are uncertain what the value of ? is. How do we express this uncertainty? Use probability? No. It is forbidden to use probability to quantify the uncertainty of parameter values in classical statistics.
Instead, classical statisticians use something called a confidence interval, which is an interval on the order of ? ? c(n), where c(n) is some number that usually depends on the number n of your data points and on the old data itself. Bigger c(n) lead to wider intervals; smaller c(n) lead to narrower ones. So you might expect that when you say that “I think ? is 5.41 plus or minus 4” you have a better chance of being right then when you say “I think ? is 5.41 plus or minus 1”, because the former interval allows you greater scope of covering the actual (unobservable) value of ?. And, classically, you’d be dead wrong.
Which is why confidence intervals are one of the screwiest things to come out of the classical tradition, in that they fail utterly to do what they set out to do. But their use is so ubiquitous not to say iniquitous) that I?m afraid you are going to have to learn to interpret them. And they are one of the most important things you must learn in this book! because you will see confidence intervals everywhere, thus it is imperative you learn what they are and what they are not.
Part of the problem is that you simply cannot learn what a confidence interval is by reading most introductory statistics books. Take, for example, the very typical book Statistics: Informed Decisions Using Data by Sullivan (2007, pp. 448-449), often used in Stats 101 courses. He officially defines a confidence interval for an unknown parameter as “an interval of numbers” (p. 449), which is as pure a tautology as you’re ever likely to meet, and being a tautology, it is therefore, of course, true, but of no help (it says the confidence interval is an interval). But a page earlier, we find Sullivan implying that smaller intervals give us less confidence in the value of the parameter than larger intervals. This implication is, as I said above, false, and is no part of the actual, mathematical definition of a confidence interval.
Maybe something like this is more accurate:
[A] 95% level of confidence…implies that, if 100 different confidence intervals are constructed…we will expect 95 of the intervals to include the parameter and 5 to not include the parameter [p. 449].
Actually, we can expect nothing like this. And though this definition is closer to the truth, it is still false (to find out why, keep reading). Incidentally, classical theory lets you calculate confidence intervals at any level you want, but the only one you ever really see is the 95% interval, so that one is all I will talk about.
Here?s the actual definition. Suppose you gather some data and construct a confidence interval using the formula C1 = {? ? c(n)} (the actual formula is not of much interest to us; the software will give us the interval automatically). That is, C1 is the interval calculated using the data we collected. Now imagine (incidentally, this is all you can do) that you re-collect your data in exactly the same way, where every physical thing is exactly the same as it was when you collected it the first time. That is, the state of the universe has to be identical to where it was when you first collected your data. Except that it must be “randomly” different, or different in ways that you know nothing about. Very well, you now have a second data set equal in every way to the first, except that it is “randomly” different, whatever that means. You then construct a new confidence interval C2 using the exact same formula on this second set of data (which is also the same size, n). Now do it all again and construct C3 , and again for C4, and again and again an infinite number of times. When you are done, 95% of those intervals will cover the actual value of ?.
(see the book)
This is shown in the picture for the first eight confidence intervals (this is all simulated data). The true value of ? is indicated by the solid line. Some of the intervals ?cover?, i.e. contain, the true value of ?, and some do not. More than that, we cannot say. Our confidence interval, the bottom bold one, is the only confidence interval we?ll actually see; the others are hypothesized entities that are conjured into existence if confidence intervals are properly interpreted.
I only showed the first 8 (out of an infinite number of) confidence intervals (that must exist for every problem you ever do). If you only repeat your experiment a finite number of times, and therefore only have a finite number of confidence intervals, say, 1,000,000, then it is false that we expect any number of them will cover the true value of ?: stopping constructing confidence intervals at any finite value invalidates the interpretation that 95% of intervals will cover the actual value of ?.
Yes, this is the actual definition, but saying it this way leaves a bad taste in one’;s mouth, especially because of that bit about “infinite” numbers of repetitions. Statisticians, feeling uneasy about infinities, and their physical impossibility, usually resort to the euphemism “long run” to describe the number of repetitions needed. They know very well that, mathematically, long run equals infinite, but saying ?in the long run? gives the comfortable impression that all you need is a lot, and not an infinite number, of repetitions.
By now you are thinking, “OK, I get it. So what? What you’re saying is just a quibble. Who cares about infinities or long runs, anyway. Give me some information I can use! What do you do with your confidence interval, the one you just constructed? What does it mean?”
Nothing. Not a thing. It certainly does not mean that you are 95% sure that your interval contains the actual value of ?. That is, you cannot, under any circumstances, say that “There is a 95% chance that the true value of ? lies in the 95% confidence interval I have constructed.” This statements is a direct probabilistic statement about the interval you have just created. Recall our key rule: it is forbidden in classical statistics to make direct probability statements about unobservable parameters. Memorize this. Your confidence interval only has an interpretation as part of an infinite set of other confidence intervals.
We have just hit upon the dirtiest open secret of classical statistics. There is no interpretation of your confidence interval other than this: the best you can say is that your interval either contains the actual value of ? or it does not, a statement which is a tautology, and, again therefore, true, but of no help (incidentally, Sullivan (2007) finally acknowledges this on p. 500). So what do you do with the interval you have just created? Why even bother, since it has no direct relation to the problem at hand? It’s even worse. Pick any two different numbers, say, 12 and 42. It is a true statement to say that this interval either contains ? or it does not for any statistical problem done by anybody with any data any time whatsoever (make sure you understand that before reading further).
The guy that invented confidence intervals, Dzerzij (Jerzy) Neyman, a statistician, knew about the interpretational problems of confidence intervals, and was concerned. But he was even more concerned about something called inductive arguments. An example due to Stove (1986): All the flames I have observed before have been hot (the premise); therefore, this flame will be hot (the conclusion). Neyman, and many other influential 20th century statisticians, rejected inductive arguments a basis for probability. They felt arguments like these were “groundless” or that inductive arguments were fallible because of the true statement that, for the flames example, there was nothing in the universe guaranteeing that this flame will be hot2. Inductive arguments are needed to make direct probabilistic statements about things like confidence intervals. If you reject them, then you cannot use probability. So Neyman, and those who followed him (which was nearly everybody), tried to take refuge in arguments like this: “Well, you cannot say that there is a 95% chance that the actual value of the parameter is in your interval; but if statisticians everywhere were to use confidence intervals, then in the long run, 95% of their intervals will contain their actual values.” Thirty-two extra credit points to those who can show the obvious flaw in this argument (see the homework).
The flaw in that argument was so obvious that it was evident to Neyman himself. And so, with nowhere else to turn, Neyman recommended a dodge and said this: “The statistician…may be recommended…to state that the value of the parameter ? is within (the just calculated interval)” merely by an act of will (Neyman (1937), quoted in Franklin (2001a)).
What you would like to be able to say is that “I have 95% (or whatever) confidence that this interval covers the true value of ?.” But you can never do this in classical statistics.
In R, to get the confidence interval of a normal distribution classically is a little more work than just getting the estimates, but it isn?t really that hard. This is for the appendicitis data, the White.Blood.Count (don?t forget to read the data in and attach it):
confint(glm(White.Blood.Count?1))
The function confint calculates 95% confidence intervals. The inside function glm, with that funny argument ?1, basically says, “The uncertainty in the variable should be quantified by a normal distribution.” Just take my word for it now; we?ll see this function later and this notation will become clear then. Anyway, after you run the command you will see something like this:
2.5 % 97.5 %
9.991874 10.818126
Ignore the word (Intercept), it is actually White.Blood.Count (this is because this function works for any variable name you care to enter). The 2.5 % and 97.5 % are like the quantiles; subtract 2.5 from 97.5 and get the length of the interval, which is 97.5%-2.5% = 95%.
We could use another R function and compute the confidence interval for ?, but it is not of great interest because later, we’ll see how to do all these things more or less automatically. Besides, we want to concentrate on what these intervals mean. If you’ve already forgotten, then go back and read this section from the beginning. One thing that is certain is that confidence intervals say nothing about the observables, the data x. If they say anything, they say something about the unobservable parameters. But what? The interval we computed for white blood count was about [10, 11]. This is an interval about estimated central parameter ? and not about the mean. We know the mean (it is…? find it in R). The confidence interval is an attempt to put a measure of precision on the guess ?. It says nothing about the mean, and nothing about actual values of white blood count. Never forget this.
5. Bayesian way
The idea behind modern statistics is that you quantify any and all uncertainty you have in anything using probability. We’ve already seen how to quantify uncertainty using probability for observables; that is, for actual data. That turns out to be done the same way classically and Bayesianly. This is what we did the first few Chapters, was it not? We wrote down some probability distribution, with known parameters, and made probability statements about observable data. Classical and Bayesian statistics begin to diverge when we start to talk about unknown parameters and how to make guesses about these parameters.
We made guesses classically by specifying some ad hoc function of the data, giving us ?; afterwards, we created a confidence interval for this guess. I stressed, heavily, that this confidence interval is not designed to express any actual uncertainty in ?, because that goes against the classical philosophy: which is that you cannot directly express uncertainty in unobservable parameters using probability.
In Bayesian statistics, you can, and must, express uncertainty in unobservable parameters using probability. How this works might sound complicated, and some of it is, but once you get how it works for, say, normal distributions, you will then know how it works for every other statistics problem in the world. This is not so for classical statistics, where you have to memorize a new set of ad hoc functions for every problem. In this way, Bayesian statistics is a vast simplification; however, before you can reach this simplification plateau, you initially have to climb up a steeper hill than you do classically. However, the good news is that there is only one hill to climb.
Let?s recall the normal probability distribution (density function):
(see the book)
written here as a function of x, or p(x|?, ?, EN ) (we could have use N() as before; the actual letter does not matter). Do you remember probability rule number 4, or Bayes’s rule? If not, go back and re-read Chapter 2. Pay special attention to equation (6). I’ll wait here until you’re done.
Back? OK, let’s write equation (6) using different letters, so that
(see the book)
becomes
(see the book)
where B is now (?, ?) and A is x. Remember, (?, ?) is shorthand for the statement “The value of the central parameter is ? and the value of the variance parameter is ?”, and x is shorthand for the statement X = “The value of the observed data is x.” We already know how to write p(x|?, ?, EN ) mathematically. Our goal is to discover how to write the left-hand side, which is the probability distribution of (?, ?) given the data and EN . This quantifies our uncertainty in (?, ?) given what we learned in the data (and considering the evidence EN ). In order to calculate the left-hand side, we then also need to know p(?, ?|EN ). We also need p(x|EN ), but once we know p(?, ?|EN ), it automatically pops out because of some math that need not concern us here.
What is p(?, ?|EN )? Well, it quantifies our uncertainty in (?, ?) before seeing any data, that is, it is only conditional on EN . p(?, ?|EN ) is a probability distribution that you have to specify before you can get to the probability p(?, ?|x, EN ). It also has an official name, which is the prior, because it?s what you know about (?, ?) prior to adding in information in the data. Not surprisingly, then, p(?, ?|x, EN ) is called the posterior, which is the probability distribution expressing everything we know, all our uncertainty, about (?, ?) after having seen some data x.
How about the value of p(?, ?|EN )? Well, it?s turns out to be a complicated situation, but the gist of it is that p(?, ?|EN ) explains the probability of each possible value of (?, ?), and since we initially know very little of (?, ?), every possible value of (?, ?) is more or less equally probable. This situation is called assigning a flat prior, the “flat” describing the shape of the probability distribution picture (i.e., a flat line)3. Once you have the prior, (There is more than one prior that you can use besides this “flat” one, but the differences it makes in the posteriors is minimal). Another problem is that the parameters are usually assumed to be continuous numbers, and if p(x|?, ?, EN ), you can then calculate the posterior using equation (17). Technically, since we are saying (?, ?) has a certain probability distribution, this is also information that we should keep note of, but we?ll append this on EN so that it now means “The uncertainty in the observable is quantified by a normal distribution and the prior on the parameters is ‘flat’.” If we need to be careful about this, and sometimes we do (not in this book), we can expand the notation to indicate the exact kind of prior we use.
Now here is another little secret: for very simple situations, the Bayesian results are the same as the classical results! No new calculations have to be learned or done!
After we take some old data, we can calculate our full uncertainty in (?, ?) by drawing pictures of the probability distributions (we’ll do this later). If we are forced to pick just one “best” value, we would pick the arithmetic mean and standard deviation, exactly like in classical statistics. If we wanted to express our uncertainty a little more fully than just using one number (for each parameter), we could give the best number and an interval, some plus/minus bound on how certain that best value actually matches the true value of (?, ?). Here is the best part: the confidence interval, which was meaningless before, is this interval, and is now called a credible interval. It has the natural interpretation that there is a 95% chance that the true value of the parameter lies in this interval. Isn’t that wild?
Before you start thinking, “Hey, if the results are the same, why did you go on and on and on about how confidence intervals are meaningless? All you did was to give them a new name! Big deal. You are wasting my time and trying to confuse me.” Hold on a minute, though. The Bayesian results are the same as the classical ones, but only for simple situations. The good news for you is, that in Stats 101, you hardly move beyond these very simple situations. Once you do move into the great statistical beyond, like using Binomial instead of normal distributions, the Bayesian methods really come into their own, and then you cannot you recall the discussion from Chapter 4, you know these can be a problem. We will ignore all these difficulties in this book. assume the classical computations give you the correct answer. I’ll talk about these techniques as we move along.
1I mean those people who were not formally trained in the mathematical subjects of probability and statistics. The vast numbers of people who compute statistics have not had this training beyond, say, a class given in a Psychology department by a professor who himself was not so trained, etc.
2To which you can argue; Ok, if you doubt it, stick your hand into this flame.