Category: Philosophy

The philosophy of science, empiricism, a priori reasoning, epistemology, and so on.

June 24, 2010 | 4 Comments

What Do You Really Want: Part II

Class is, thank God, rapidly coming to an end. I am sure we are all grateful. Here’s an addendum to yesterday; but only briefly explained. Because of the crush of the end of the class, I have no idea what’s going on in the world. All emails will be answered starting over the weekend.

We talked earlier of the widespread misuse of normal distributions. They are used with wild abandon in instances where they have no place. Normals should only be used to represent the uncertainty in numbers, the range and diversity of which warrant a reasonable approximation. Here’s what I mean.

Modeling College Grade Point (CGPA) makes an excellent illustration: we want to predict an incoming freshman’s CGPA at the end of the term. Since we do not know, in advance, what that CGPA will be, we can express our uncertainty in it using some probability model. Like everybody else, we’ll use a normal.

To help us predict CGPA, we also have beginning freshmen SAT scores. We might expect that students coming in with higher SATs will have higher CGPA at the end of the first year. Make sense? (In this example, we can, as I often do, also use high school GPA and other pertinent measures; however, the point is easily made with just one variable.)

We start by collecting data on historical freshman. That is, we look at a batch of CGPAs from students in prior years; we also collect their SATs. Like all good statisticians, we display the data in a plot. Doing so, we discover that there is a rough trend in the expected direction: higher SATs are associated with higher CGPA. The relationship is, of course, not perfect. Not every kid with a high SAT has a high CGPA; and not every kid with a low SAT has a low CGPA.

The usual next step is to use a regression model, in which CGPA is predicted by a straight-line function of SAT plus some “random” error. An out-of-the-box regression model uses normal distributions to characterize that “random” error.

Now, pretty much every implementation of software will fit this regression model and then spit out the coefficients of the model: these coefficients are also called parameters, or “betas”. It’s all one. Classically, the thing to do is to form the “null” hypothesis that the parameter associated with SAT in the regression model is precisely equal to 0. The p-value is glanced at, and if—we hope!—it is less than the magic value of 0.05, the classical statistician will announce, “Higher SATs associated with higher CGPAs.” They are not entitled to say that, but never mind. Close enough.

In data that I have, I run this exact test and have received a very publishable p-value of about 0.001. “Highly significant!” it would be said. But let’s be Bayesian and examine the posterior distribution of the parameter. We can compute the probability that the parameter associated with SAT is larger than zero (given the information in our sample, and given the truth of the normal-regression model we used).

OK; that posterior probability is over 99.9%, which means we can be pretty darn sure that the parameter associated with SAT is larger than zero. We are now on firmer ground when we say that “Higher SATs associated with higher CGPAs.”

But just think: we already knew that higher SATs were associated with higher CGPAs. That’s what our old data told us! In our old data, we were 100% sure that, roughly, higher SATs were associated with higher CGPAs. What we really want to know, however, is the CGPAs of future freshmen. We already know all about the freshmen in our dataset.

Now suppose we know that an incoming freshman next year will have an SAT score of 1000. Modern predictive/objective Bayesian analysis allows us to compute the uncertainty we have in the CGPA of this freshman (and of every other freshman who has an SAT of 1000). That is, we can compute the probability that the observable, actual CGPA will take any particular value. This is not the same thing as saying the parameter in the model takes any value. This tells us about what we can actually see and measure.

Here’s the problem. Doing this (for this data) shows us the probability of future CGPAs all right; but because we used normal distributions we have a significant, real probability of seeing CGPAs larger than 4.0. And we also see a significant probability of seeing CGPAs smaller than 0. Both situations are, of course, impossibilities. But because we used a normal distribution, we have about a 10% for the impossible!

Which merely means the normal model stinks. But we never—not ever—would have had a clue of its rottenness if we just examined the p-value or the posterior of the parameter. Parameters are not observable!

This also shows you that even predictive/objective Bayesian analysis fails when you start with a bad model. A bad model stinks in any philosophy. I hope you realize that using the old ways will never give you a hint that your model is bad: you will never know you are making impossible predictions just by looking at parameters.

June 22, 2010 | 7 Comments

Lesson n+1: Measurement & Observables

Just a very crude sketch today: it is not complete by any stretch. Naturally, the students in the summer class don’t receive this level of information.

Best we can tell, the universe is, at base, discrete. That is, space comes to us in packets, chunks of a definite size, roughly 10-35 meters on a side. You may think of quantum mechanics; quantum, after all, means discrete.

Now, even if this isn’t so; that is, even if the universe proves to exist as an infinitely divisible continuum, it will still be true that we cannot measure it except discreetly.

Take, for example, a physician reading blood pressure with an ordinary sphygmomanometer, the cuff with a pump and the small analog dial. At best, a physician can reliably, at a glance, gauge blood pressure to within 1 millimeter of mercury. Even digital versions of this instrument fare little better.

But, of course, these instruments can improve. The readout can continue to add decimal places as the apparatus better discerns the amount of mercury forced through a tube, even to the point—but no further—than counting individual molecules. Fractional or continuous molecules aren’t in it.

Further, every measurement is also constrained by certain bounds, which are a function of the instrument itself and the milieu in which it is employed. That is, actual measurements do not, and can not, shoot off to infinity (in either direction).

Every measurement we take is the same. This means that when we are interested in some observable, particularly in quantifying the uncertainty of this observable, we know that it can take only one value out of a set of values. That is, the observable can only take one value at a time.

I am considering what is called a “univariate” observable; also called a point measurement. It doesn’t matter if the observable is “multivariate”, also called a vector measurement. If a vector, then each element in the vector can take only one out of a set of values at any one time.

We also know that any set of measurements we take is finite. Finite can be very large, of course, but large is always short of infinite. We might not know, and often do not know, how many measurements we can take of any observable, but we always know that this count will be finite.

The situation of measuring any observable at discrete levels a finite number of times is exactly like the following situation: a bag contains N objects, some of which may be labeled 1 and the others something else. That is, any object may be a 1 or it may not be. That statement is a tautology; and based on the very limited information in it, we can tell is that an object with a 1 on it is possible.

In this bag, then, there can be no objects with a 1 on it, 1 such object, 2 such objects, and so on up to all N objects. We want the probability that no objects have a 1, just one does, and so on. Through the theorem of the symmetry of individual constants (which we can prove another day), it is easy to show that the probability of any particular outcome is 1 / (N + 1), because there are N + 1 possible outcomes.

This is, of course, the uniform distribution, in line with what people usually call an “ignorance” or “flat” prior. But it is not a prior in the usual sense. It is different because there are no parameters here, only observables. This small fact becomes the fundamental basis of the marriage of finite measurement with probability.

Suppose we take a few—something less than N—objects from the bag and note their labels. Some, none, or all of these objects will have a 1. Importantly, the number of 1s we saw in our sample give us some information about the possible values of the rest of the objects left in the bag.

No matter the value of N, we can work out the probability that no remaining objects are labeled 1, that just one is, and so on. Again, no parameters are needed. We are still talking about observables and observables only.

We can continue this process by removing more, but not yet all, objects from the bag. This gives us updated information, which we can use to update the probability that no objects remaining are labeled 1, that just one is, and so on. (For those who know, this is a hypergeometric distribution.)

Once more, we still have no need of parameters; we still talk of observables. This assumed we knew N, and that N was finite. But if we do not know N, but do know it is “large”, we can take it to the limit, and then use the resulting probabilities as approximations to the true ones. (This limit is the binomial). The limiting distribution then speaks of parameters—it is important to understand that they only arise because of the limiting (approximating) operation.

Well, you might have the idea. If we do not know N, and cannot say it is “large”, we can apply the same logic to its value as we did to the labels. Point is, all of probability can fit into a scheme where no parameters are ever needed, where everything starts with the simplest assumptions, and ends quantifying uncertainty in only what can be measured.

June 21, 2010 | 4 Comments

Lesson Somethingorother: Against the P-value

I’ve lost count of the lesson numbers.

The definition of a p-value, here phrased in the incorrectly named “test of difference in means”, is:

Given the truth of a probability model used to represent the uncertainty in the observables in group A and in group B, and given that one or more of the parameters of those probability models are equal, and given that the “experiment” from which the data collected in A and B were to be repeated a number of times approaching the limit, the probability of seeing a statistic calculated from each of these repetitions larger (in absolute value) than the one we actually found.

Incidentally, there is no need to ever “test” for the difference in means, because means can be computed; that is, observed. You can tell at a glance whether they are different. The actual hypothesis test says something indirect about parameters. Anyway, if the parameters are equal, the model for the two groups are the same. In this case, it is called the “null” model (statisticians are rarely clever in naming their creations).

There are a number of premises in the p-value definition, some more controversial than others. Begin with the truth of the model.

It is rare to have deduced the model that represents the uncertainty in some observable. To deduce a model requires outside evidence or premises. Usually, this evidence is such that we can only infer the truth of a model. And even more cases, we know, based on stated evidence, that a model is false.

Now, if this is so—if the model is known to be false—then the p-value cannot be computed. Oh, the mechanics of the calculation can still be performed. But the truth of the output in conjunction with the truth of model is false. If the model, again based on stated evidence, is only probably true, then the p-value can be calculated, but its truth and the truth of the model must be accounted for, and almost never is.

Equating the parameters is relatively uncontroversial, but small p-values are taken to mean the parameters are not equal; yet the p-value offers no help about how different they are. In any case, this is one of the exceedingly well know standard objections to p-values, which I won’t rehearse here. Next is a better argument unknown to most.

Fisher often said something like this (you can find a modified version of this statement in any introductory book):

Belief in the null model as an accurate representation of the population sampled is confronted by a logical disjunction: Either the null model is false, or the p-value has attained by chance an exceptionally low value.

Many have found this argument convincing. They should not have. First, this “logical disjunction” is evidently not one, because the first part of the sentence makes a statement about the unobservable null model, and the second part makes a statement about the observable p-value. But it is clear that there are implied missing pieces, and his quote can be fixed easily like this:

Either the null model is false and we see a small p-value, or the null model is true and we see a small p-value.

And this is just

Either the null model is true or it is false and we see a small p-value.

But since “Either the null model is true or it is false” is a tautology (it is always true), and since any tautology can be conjoined with any statement and not change its truth value, what we have left is

We see a small p-value.

Which is of no help at all. The p-value casts no direct light on the truth or falsity of the null model. This result should not be surprising, because remember that Fisher argued that the p-value could not deduce whether the null was true; but if it cannot deduce whether the null is true, it cannot, logically, deduce whether it is false; that is, the p-value cannot falsify the null model (which was his main hope in creating them).

Recall that making probabilistic statements about the truth value of parameters or models is forbidden in classical statistics. An argument pulled from classical theory illustrates this.

If the null model is true, then the p-value will be between 0 and 1.
We see a small p-value.
——————————————————————————————————————-
The null model is false.

Under the null, the p-value is uniformly distributed (the first premise); which is another way of saying, “If the null is true, we will see any p-value whatsoever.” That we see any value thus gives no evidence for the conclusion.

Importantly, the first premise is not that “If the null model is true, then we expect a ‘large’ p-value,” because we clearly do not.

Since p-values—by design!—give no evidence about the truth or the falsity of the null model, it’s a wonder that their use ever caught on. But there is a good reason why they did. That’s for next time.

June 18, 2010 | 7 Comments

Lesson Five or Six: Abnormality

Say! What happened to lessons three through four or five? Who knows. This morning, I’m dreadfully rushed, so just a sketch. I do not expect anybody to be convinced this fine day.

Where were we?

Suppose I’m interested in the ages (in whole years) of my blog readers. Now, except for about three or four exceptions, I don’t know these ages, do I? Which means I’m uncertain, and thus I’ll use some kind of probability model to quantify my uncertainty in these numbers.

In some cases, I can supply premises (evidence, information) that allow me to deduce the probability model that represents my uncertainty in some observable. This applies to most casino games of chance.

But most times I cannot find such evidence. That is, there do not exist plausible premises that allow me to say that a certain probability model is the probability model that should be used. What to do? Why, just assume for the sake of argument that I do know which probability model that should be used! Problem solved.

Most times, for anything that resembles a number (like ages), a normal distribution is used. This is usually done through laziness, custom, or because other choices are unknown. Before I can describe just what assuming a probability model does, we should understand what a normal distribution is.

It is the bell-shaped curve you’ve heard of, and it gives the probability of every number. And every number is just that: every number. How many are every? Well, from all the way out to negative infinity, progressing through zero, and shooting off towards positive infinity. And in between these infinities, are infinite other numbers. Why, even between the interval 0 and 1 there are an infinite number of numbers.

Because of this quirk of mathematics, when using the the normal to quantify probability, the probability of any number is zero in all problems (not just ages). That is, given we accept a normal distribution, the probability of seeing an age of (say) 40 is precisely zero. The probability of seeing 41 is zero, as it the probability of seeing 42, 43, and so on.

As said, this isn’t just for ages: the probability of any number anywhere in any situation is zero when using normals. But even though the probability of anything happening is zero, we can still (bizarrely) calculate the probability of intervals of numbers. For example, we can say that, given a normal, the chance of seeing ages between 40 and 45 is some percent; even though each of the numbers in that interval can’t happen.

Somewhat abnormal, no? It’s still worse because, as said, normals give probability to every interval, including the interval from negative infinity to zero. Which in our case translates to a definite probability of ages less than 0. It also means that we have positive probability to ages greater than, say, 130. An example later will make this all clearer.

The main point: the normal stinks as a vehicle to describe uncertainty. So why is it used? Because mathematicians love mathematics, and because of a misunderstanding of what statisticians call the central limit theorem. That theorem says that, for any set of numbers, their averages converge to a normal distribution as the sample size grows to infinity.

This theorem is correct; it’s mathematics precise and true. But not all mathematical constructions have any real-life applicability. Anyway, the central limit theorem is a theorem about averages, not actual observations.

Plus we have the problem that we’re not interested in averages of the ages, but of the ages themselves. Another problem: I don’t (sad to say) have infinite numbers of readers.

Yet it is inescapable that normal distributions are used all the time everywhere and that it is sad that they can sometimes give reasonable approximations. Both statements are true. They are ubiquitous (I almost wrote iniquitous). And they can give reasonable approximations. It’s just that they often do not.

We have to understand what is meant by “approximation”. This is tricky; almost as tricky as viewing probability as logic for the first time.

Now, based on my knowledge that ages are in whole years, and that nobody can be less than 0, and that nobody can be of Methuselahian age, the probability that any pronouncement I make using a normal distribution about ages is true is exactly 0; which is to say, it is false. This means I know with certainty that I will be talking gibberish when I use a normal.

Unless I add a premise which goes something like, “All pronouncements will be roughly correct; but none will be exactly correct.” And what does that imply? Well, we shall see.

(Fisher, incidentally, knew of the problems of normals and warned users to be cautious. But like his warning about over-reliance on p-values, the warning was quickly forgotten.)