February 18, 2008 | 37 Comments

## Statistics’ dirtiest secret

The old saying that “You can prove anything using statistics” isn’t true. It is a lie, and a damned lie, at that. It is an ugly, vicious, scurrilous distortion, undoubtedly promulgated by the legion of college graduates who had to suffer, sitting mystified, through poorly taught Statistics 101 classes, and never understood or trusted what they were told.

But, you might be happy to hear, the statement is almost true and is false only because of a technicality having to do with the logical word prove. I will explain this later.1

Now, most statistics texts, even advanced ones, if they talk about this subject at all, tend to cover it in vague or embarrassed passages, preferring to quickly return to more familiar ground. So if you haven’t heard about most of what I’m going to tell you, it isn’t your fault.

Before we can get too far, we need some notation to help us out. We call the data we want to predict `y`, and if we have some ancillary data that can help us predict `y`, we call it `x`. These are just letters that we use as place-holders so we don’t have to write out the full names of the variables each time. Do not let yourself be confused by the use of letters as place-holders!

An example. Suppose we wanted to predict a person’s income. Then “a person’s income” becomes `y`. Every time you see `y` you should think “a person’s income”: clearly, `y` is easier to write. To help us predict income, we might have the sex of the person, their highest level of education, their field of study, and so on. All these predictor variables we call `x`: when you see `x`, think “sex”, “education”, etc.

The business of statistics is to find a relationship between the `y` and the `x`: this relationship is called a model, which is just a function (a mathematical grouping) of the data `y` and `x`. We write this as `y = f(x)`, and it means, “The thing we want to know (`y`) is best represented as a combination, a function, of the data (`x`).” So, with more shorthand, we write a mathematical combination, a function of `x`, as `f(x)`. Every time you see a statistic quoted, there is an explicit or implicit   “`f(x)`“, a model, lurking somewhere in the background. Whenever you hear the term “Our results are statistically significant“, there is again some model that has been computed. Even just taking the mean implies a model of the data.

The problem is that usually the function `f(x)` is not known and must be estimated, guessed at in some manner, or logically deduced. But that is a very difficult thing to do, so nearly all of the time the mathematical skeleton, the framework, of `f(x)` is written down as if it were known. The `f(x)` is often chosen by custom or habit or because alternatives are unknown. Different people, with the same `x` and `y`, may choose different `f(x)`. Only one of them, or none of them, can be right, they both cannot be.

It is important to understand that all results (like saying “statistically significant”, computing p-values, confidence or credible intervals) are conditional on the model that chosen being true. Since it is rarely certain that the model used was true, the eventual results are stated with a certainty that is too strong. As an example, suppose your statistical model allowed you to say that a certain proposition was true “at the 90% level.” But if you are only, say, 50% sure that the model you used is the correct one, then your proposition is only true “at the 45% level” not at the 90% level, which is, of course, an entirely different conclusion. And if you have no idea how certain your model is, then it follows that you have no idea how certain your proposition is. To emphasize: the uncertainty in choosing the model is almost never taken into consideration.

However, even if the framework, the `f(x)`, is known (or assumed known), certain numerical constants, called parameters, are still needed to flesh out the model skeleton (if you’re fitting a normal distribution, these are the μ and σ^2 you might have heard of). These must be guessed, too. Generally, however, everybody knows that the model’s parameters must be estimated. What you might not know is that the uncertainty in guessing the parameter values also has to carry through to statements of certainty about data propositions. Unfortunately, this is also rarely done: most statistical procedures focus on making statements about the parameters and virtually ignore actual, observable data. This again means that people come away from these procedures with an inflated sense of certainty.

If you don’t understand all this, especially the last part about parameters, don’t worry: just try to keep in mind that two things happen: a function `f(x)` is guessed at, and the parameters, the numerical constants, that make this equation complete must also be guessed at. The uncertainty of performing both of these operations must be carried through to any conclusions you make, though, again, this is almost never done.

These facts have enormous and rarely considered consequences. For one, it means that nearly all statistics results that you see published are overly boastful. This is especially true in certain academic fields where the models are almost always picked as the result of habit, even enforced habit, as editors of peer-reviewed journals are suspicious of anything new. This is why—using medical journals as an example—one day you will see a headline that touts “Eating Broccoli Reduces Risk of Breast Cancer,” only to later read, “The Broccolis; They Do Nothing!” It’s just too easy to find results that are “statistically significant” if you ignore the model and parameter uncertainties.

These facts, shocking as they might be, are not quite the revelation we’re after. You might suppose that there is some data-driven procedure out there, known only to statisticians, that would let you find both the right model and the right way to characterize its parameters. It can’t be that hard to search for the overall best model!

It’s not only hard, but impossible, a fact which leads us to the dirty secret: For any set of `y` and `x`, there is no unconditionally unique model, nor is there any unconditionally unique way to represent uncertainty in the model’s parameters.

Let’s illustrate this with respect to a time series. Our data is still `y`, but there is no specific `x`, or explanatory data, except for the index, or time points (`x` = time 1, time 2, etc.), which of course are important in time series. All we have is the data and the time points (understand that these don’t have be clock-on-the-wall “time” points, just numbers in a sequence).

Suppose we observe this sequence of numbers (a time series)

`y = 2, 4, 6, 8`; with index `x = 1, 2, 3, 4`

Our task is to estimate a model `y = f(x)`. One possibility is Model A

`f(x) = 2x`

which fits the data perfectly, because `x = 1, 2, 3, 4` and `2x = 2, 4, 6, 8` which is exactly what `y` equals. The “2” is the parameter of the model, which here we’ll assume we know with certainty.

But Model B is

`f(x) = 2x |sin[(2x+1)π/2]|`

which also fits the data perfectly (don’t worry if you can’t see this—trust me, it’s an exact fit; the “2”s, the “1” and the “π” are all known-for-certain parameters).

Which of these two models should we use? Obviously, the better one; we just have to define what we mean by better. Which model is better? Well, using any—and I mean any—of the statistical model goodness-of-fit measures that have ever, or will ever, be invented, both are identically good. Both models explain all the data we have seen without error, after all.

There is a Model C, Model D, Model E, and so on and on forever, all of which will fit the observed data perfectly and so, in this sense, will be indistinguishable from one another.

What to do? You could, and even should, wait for more data to come in, data you did not use in any way to fit your models, and see how well your models predict these new data. Most times, this will soon tell you which model is superior, or if you are only considering one model, it will tell you if it is reasonable. This eminently common-sense procedure, sadly, is almost never done outside the “hard” sciences (and not all the time inside these areas; witness climate models). Since there are an infinite number of models that will predict your data perfectly, it is no great trick to find one of them (or to find one that fits well according to some conventional standard). We again find that published results will be too sure of themselves.

Suppose in our example the new data is `y = 10, 12, 14`: both Models A and B still fit perfectly. By now, you might be getting a little suspicious, and say to yourself, “Since both of these models flawlessly guess the observed data, it doesn’t matter which one we pick! They are equally good.” If your goal was solely prediction of new data, then I would agree with you. However, the purpose of models is rarely just raw prediction. Usually, we want to explain the data we have, too.

Models A and B have dramatically different explanations of the data: A has a simple story (“time times 2!”) and B a complex one. Models C, D, E, and so on, all too have different stories. You cannot just pick A via some “Occam’s razor2” argument; meaning A is best because it is “simpler”, because there is no guarantee that the simpler model is always the better model.

The mystery of the secret lies in the word “unconditional”, which was a necessary word in describing the secret. We can now see that there is no unconditionally unique model. But there might very well be a conditionally correct one. That is, the model that is unique, and therefore best, might be logically deducible given some set of premises that must be fulfilled. Suppose those premises were “The model must be linear and contain only one positive parameter,” then Model B is out and can no longer be considered. Model A is then our only choice: we do not, given these premises, even need to examine Models C, D, and so on, because Model A is the only function that fills the bill; we have logically deduced the form of Model A given these premises.

It is these necessary external premises that help us with the explanatory portion of the model. They are usually such that they demand the current model be consonant with other known models, or that the current model meet certain physical, biological, or mathematical expectations. Regardless, the premises are entirely external to the data at hand, and may themselves be the result of other logical arguments. Knowing the premises, and assuming they are sound and true, gives us our model.

The most common, unspoken of course, premise is loosely “The data must be described by a straight line and a normal distribution”, which, when invoked, describes the vast majority of classical statistical procedures (regression, correlation, ANOVA, and on and on). Which brings us full circle: the model and statements you make based on it are correct given the “straight line” premise is true, it is just that the “straight line” premise might be, and usually is, false.3

Because there are no unconditional criteria which can judge which statistical model is best, you often hear people making the most outrageous statistical claims, usually based upon some model that happened to “fit the data well.” Only, these claims are not proved, because to be “proved” means to be deduced with certainty given premises that are true, and conclusions based on statistical models can only ever be probable (less than certain and more than false). Therefore, when you read somebody’s results, pay less attention to the model they used and more to the list of premises (or reasons) given as to why that model is the best one so that you can estimate how likely the model that was used is true.

Since that is a difficult task, at least demand that the model be able to predict new data well: data that was not used, in any way, in developing the model. Unfortunately, if you added that criterion to the list of things required before a paper could be published, you would cause a drastic reduction in scholarly output in many fields (and we can’t have that, can we?).

##### 3Why these “false” models sometimes “work” will be the discussion of another article; but, basically, it has to do with people changing the definition of what the model is mid-stream.
February 17, 2008 | 5 Comments

## 800 gram balls: Key words in my log files

Every now and then I have a glance at my log files to see what kinds of key words people type into sites like Google and who are subsequently directed to my site. It won’t surprise you that I see things like `briggs` and `bad statistics examples`. But there is a class of keywords that I can only describe as odd, even, at times, worrying. Here are those keywords (all spellings are as they were found), split into rough categories. My comments, if any, appear in parentheses. Each of these keywords are real.

### Statistics

• `don't forget about us model` (I could never)
• `great statisticians` (flatterer)
• `how to exaggerate` (think big, think big)
• `i need to be statician` (it can be a powerful force, it’s true; learning to spell it correctly will help)
• `some pictures of statistician` (here’s somebody with a lot of time on their hands)
• `statisticians aviod doing things because other people are doing it` (I think he has us confused with accountants)
• `statistician god exists `(His name is Stochastikos)
• `virginity statistics` (score: 0 to 0)
• `lifelong virginity statistics` (score still tied)
• `what to look for in a statician` (get one of the tall ones; we have a sense of humor)
• `why do statisticians love tables?` (because we can’t help ourselves)
• `you cannot be a scientist if you are not a good mathematician` (I have the feeling that this person desired a negative answer)

### Zombies

• `factors that cause zombism` (blogging…)
• `recorded zombie outbreaks`
• `what year will zombies take over the earth?` (has to be soon)
• `wild zombies` (as opposed to domesticated?)
• `will zombie attacks happen`
• `zombies can happen` (he might have been trying to answer the other guy)
• `zombies in nature`
• `zombies true or false`

### Miscellaneous

• `800g balls` (mine are only 760g–in petanque, of course!)
• `anything` (I can see Google knows where to go…)
• `beer does not have enough alcohol` (which is why I tend to stick with rum)
• `home is where the heart is william briggs` (somebody’s trying to give me a lesson)
• `horizontal alcoholic` (is there any other kind?)
• `how does pseudoscience effect the mind` (badly)
• `lee majors george bush` (you can’t go wrong aligning yourself with the six-million dollar man)
• `man's got his limits briggs` (true enough; must be same advice giver as before)
• `purposely causing someone to get cancer` (oh my…no murder tips here)
• `sentence with the word, "impossibility"` (shouldn’t be hard to come by)
• `what can we do not to be poor` (get a job)
February 15, 2008 | 58 Comments

## Consensus in science

In 1914, there was a consensus among geologists that the earth under our feet was permanently fixed, and that it was absurd to think it could be otherwise. But in 1915, Alfred Wegener fought an enormous battle to convince them of the relevance of plate tectonics.

In 1904, there was a consensus among physicists that Newtonian mechanics was, at last, the final word in explaining the workings of the world. All that was left to do was to mop up the details. But in 1905, Einstein and a few others soon convinced them that this view was false.

In 1544, there was a consensus among mathematicians that it was impossible to calculate the square root of negative one, and that to even consider the operation was absurd. But in 1545, Cardano proved that, if you wanted to solve polynomial equations, then complex numbers were a necessity.

In 1972, there was a consensus among psychiatrists that homosexuality was a psychological, treatable, sickness. But in 1973, the American Psychiatric Association held court and voted for a new consensus to say that it was not.

In 1979, there was a consensus among paleontologists that the dinosaurs’ demise was a long, drawn out affair, lasting millions of years. But in 1980, Alvarez, father and son, introduced evidence of a cataclysmic cometary impact 65 million years before.

In 1858, there was a consensus among biologists that the animal species that surround us were put there as God designed them. But in 1859, the book On the Origin of Species appeared.

In 1928, there was a consensus among astronomers that the heavens were static, the boundaries of the universe constant. But in 1929, Hubble observed his red shift among the stars.

In 1834, there was a consensus among physicians that human disease was spontaneously occurring, due to imbalanced humours. But in 1835, Bassi and later Pasteur, introduced doctors to the germ theory.

All these are, obviously, but a small fraction of the historical examples of consensus in science, though I have tried to pick the events that were the most jarring and radical upsets. Here are two modern cases.

In 2008, there is a consensus among climatologists that mankind has and will cause irrevocable and dangerous changes to the Earth’s temperature.

In 2008, there is a consensus among physicists that most of nature’s physical dimensions are hidden away and can only be discovered mathematically, by the mechanisms of string theory.

In addition to the historical list, there are, just as obviously, equally many examples of consensus that turned out to be true. And, to be sure, even when the consensus view was false, it was often rational to believe it.

So I use these specimens only to show two things: (1) from the existence of a consensus, it does not follow that the claims of the consensus are true. (2) The chance that the consensus view turns out to be false is much larger than you would have thought.

These are not news, but they are facts that are often forgotten.

February 14, 2008 | 14 Comments

## Do not calculate correlations after smoothing data

This subject comes up so often and in so many places, and so many people ask me about it, that I thought a short explanation would be appropriate. You may also search for “running mean” (on this site) for more examples.

Specifically, several readers asked me to comment on this post at Climate Audit, in which appears an analysis whereby, loosely, two time series were smoothed and the correlation between them was computed. It was found that this correlation was large and, it was thought, significant.

I want to give you, what I hope is, a simple explanation of why you should not apply smoothing before taking correlation. What I don’t want to discuss is that if you do smooth first, you face the burden of carrying through the uncertainty of that smoothing to the estimated correlations, which will be far less certain than when computed for unsmoothed data. I mean, any classical statistical test you do on the smoothed correlations will give you p-values that are too small, confidence intervals too narrow, etc. In short, you can be easily misled.

Here is an easy way to think of it: Suppose you take 100 made-up numbers; the knowledge of any of them is irrelevant towards knowing the value of any of the others. The only thing we do know about these numbers is that we can describe our uncertainty in their values by using the standard normal distribution (the classical way to say this is “generate 100 random normals”). Call these numbers `C`. Take another set of “random normals” and call them `T`.

I hope everybody can see that the correlation between `T` and `C` will be close to 0. The theoretical value is 0, because, of course, the numbers are just made up. (I won’t talk about what correlation is or how to compute it here: but higher correlations mean that `T` and `C` are more related.)

The following explanation holds for any smoother and not just running means. Now let’s apply an “eight-year running mean” smoothing filter to both `T` and `C`. This means, roughly, take the 15th number in the `T` series and replace it by an average of the 8th and 9th and 10th and … and 15th. The idea is, that observation number 15 is “noisy” by itself, but we can “see it better” if we average out some of the noise. We obviously smooth each of the numbers and not just the 15th.

Don’t forget that we made these numbers up: if we take the mean of all the numbers in `T` and `C` we should get numbers close to 0 for both series; again, theoretically, the means are 0. Since each of the numbers, in either series, is independent of its neighbors, the smoothing will tend to bring the numbers closer to their actual mean. And the more “years” we take in our running mean, the closer each of the numbers will be to the overall mean of `T` and `C`.

Now let `T' = 0,0,0,...,0` and `C' = 0,0,0,...,0`. What can we say about each of these series? They are identical, of course, and so are perfectly correlated. So any process which tends to take the original series `T` and `C` and make them look like `T'` and `C'` will tend to increase the correlation between them.

In other words, smoothing induces spurious correlations.

Technical notes: in classical statistics any attempt to calculate the ordinary correlation between `T'` and `C'` fails because that philosophy cannot compute an estimate of the standard deviation of each series. Again, any smoothing method will work this magic, not just running means. In order to “carry through” the uncertainty, you need a carefully described model of the smoother and the original series, fixing distributions for all parameters, etc. etc. The whole also works if `T` and `C` are time series; i.e. the individual values of each series are not independent. I’m sure I’ve forgotten something, but I’m sure that many polite readers will supply a list of my faults.