Skip to content

Category: Statistics

The general theory, methods, and philosophy of the Science of Guessing What Is.

February 27, 2008 | 4 Comments

Today’s excuses for not posting


The Real Academia de Ciencias Exactas, F?sicas y Naturales of Spain and the Fundaci?n Ram?n Areces are sponsoring that conference in Madrid on 2-3 April. The symposium will begin with the IPCC assessment and then move to “Critical assessment taking into consideration past (Pleistocene to historical) climate change and the nature of chemical forcing, as well as the characteristics of physical and numerical models used, their potentials, limitations and uncertainties.”

I’ve been asked to speak on the “Robustness and uncertainties of climate change predictions.” I have a deadline of this Saturday to hand in my abstract, and a couple of weeks to hand in a paper and presentation. I’m trying to walk a line between showing too much statistics and too little. By that I mean math. I keep going back and forth on this, trying to decide the best way to present. I haven’t decided, but, hey, I have three days left, right? Whatever I come up with will eventually be posted here. In a day or two–after my abstract is finished–I’ll post the conference program.

Reporters at the New York Times

A copy of Monday’s Times was given to me. I don’t subscribe, by the way, since I have learned from that august publication that because I am a veteran of the U.S. Armed Forces, I might explode at any moment and start murdering anybody in sight. I’m not disagreeing with this, of course; I just don’t want to be reminded of it.

Anyway, a lead story in the Business section, written by somebody called Noam Cohen, started thusly, “Of the many landmarks along a journalist’s career, two are among those that stand out: winning an award and making the government back down” (emphasis mine).

And people wonder where journalist’s cynicism comes from?

Stuff White People Like

This site, written by someone called Clander, has been making the rounds and is hilarious. Some examples. “#75 Threatening to Move to Canada” (a friend of mine did this after Bush “stole” his first election), “#65 Co-Ed Sports” (one of my favorites), and “#62 Knowing what?s best for poor people“.

That post tells us

White people spend a lot of time of worrying about poor people….They feel guilty and sad that poor people shop at Wal*Mart instead of Whole Foods, that they vote Republican instead of Democratic, that they go to Community College/get a job instead of studying art at a University…It is a poorly guarded secret that, deep down, white people believe if given money and education that all poor people would be EXACTLY like them. In fact, the only reason that poor people make the choices they do is because they have not been given the means to make the right choices and care about the right things…But it is ESSENTIAL that you reassert that poor people do not make decisions based on free will. That news could crush white people and their hope for the future.

The accompanying picture is priceless. What’s even better are the reader’s comments, particularly those, presumably white, people who take exception to Clander’s observations.

February 25, 2008 | 16 Comments

Example of how easy it is to mislead yourself: stepwise regression

I am, of course, a statistician. So perhaps it will seem unusual to you when I say I wish there were fewer statistics done. And by that I mean that I’d like to see less statistical modeling done. I am happy to have more data collected, but am far less sanguine about the proliferation of studies based on statistical methods.

There are lots of reasons for this, which I will detail from time to time, but one of the main ones is how easy it is to mislead yourself, particularly if you use statistical procedures in a cookbook fashion. It takes more than a recipe to make an eatable cake.

Among the worst offenders are methods like data mining, sometimes called knowledge discovery, neural networks, and other methods that “automatically” find “significant” relationships between sets of data. In theory, there is nothing wrong with any of these methods. They are not, by themselves, evil. But they become pernicious when used without a true understanding of the data and the possible causal relationships that exist.

However, these methods are in continuous use and are highly touted. An oft-quoted success of data mining was the time a grocery store noticed that unaccompanied men who bought diapers also bought beer. A relationship between data which, we are told, would have gone unnoticed were it not for “powerful computer models.”

I don’t want to appear too negative: these methods can work and they are often used wisely. They can uncover previously unsuspected relationships that can be confirmed or disconfirmed upon collecting new data. Things only go sour when this second step, verifying the relationships with independent data, is ignored. Unfortunately, the temptation to forgo the all-important second step is usually overwhelming. Pressures such as cost of collecting new data, the desire to publish quickly, an inflated sense of certainty, and so on, all contribute to this prematurity.


Stepwise regression is a procedure to find the “best” model to predict y given a set of x’s. The y might be the item most likely bought (like beer) given a set of possible explanatory variables x, like x1 sex, x2 total amount spent, x3 diapers purchased or not, and on and on. The y might instead be total amount spent at a mall, or the probability of defaulting on a loan, or any other response you want to predict. The possibilities for the explanatory variables, the x’s, are limited only to your imagination and ability to collect data.

A regression takes the y and tried to find a multi-dimensional straight line fit between itself and the x’s (e.g., a two-dimensional straight line is a plane). Not all of the x’s will be “statistically significant1“; those that are not are eliminated from the final equation. We only want to keep those x’s that are helpful in explaining y. In order to do that, we need to have some measure of model “goodness”. The best measure of model goodness is one which measures how well that model does predicting independent data, which is data that in no way was used to fit the model. But obviously, we do not always have such data at hand, so we need another measure. One that is often picked is the Akaike Information Criterion (AIC), which measures how well the model fits the data that was used to fit the model.

Confusing? You don’t actually need to know anything about the AIC other than that lower numbers are better. Besides, the computer does the work for you, so you never have to actually learn about the AIC. What happens is that many combinations of x’s are tried, one by one, an AIC is computed for that combination, and the combination that has the lowest AIC becomes the “best” model. For example, combination 1 might contain (x2, x17, x22), while combination 2 might contain (x1, x3). When the number of x’s is large, the number of possible combinations is huge, so some sort of automatic process is needed to find the best model.

A summary: all your data is fed into a computer, and you want to model a response based on a large number of possible explanatory variables. The computer sorts through all the possible combinations of these explanatory variables, rates them by a model goodness criterion, and picks the one that is best. What could go wrong?

To show you how easy it is to mislead yourself with stepwise procedures, I did the following simulation. I generated 100 observations for y’s and 50 x’s (each of 100 observations of course). All of the observations were just made up numbers, each giving no information about the other. There are no relationships between the x’s and the y2. The computer, then, should tell me that the best model is no model at all.

But here is what it found: the stepwise procedure gave me a best combination model with 7 out of the original 50 x’s. But only 4 of those x’s met the usually criterion for being kept in a model (explained below), so my final model is this one:

explan. p-value Pr(beta x| data)>0
x7 0.0053 0.991
x21 0.046 0.976
x27 0.00045 0.996
x43 0.0063 0.996

In classical statistics, an explanatory variable is kept in the model if it has a p-value< 0.05. In Bayesian statistics, an explanatory variable is kept in the model when the probability of that variable (well, of its coefficient being non-zero) is larger than, say, 0.90. Don't worry if you don't understand what any of that means---just know this: this model would pass any test, classical or modern, as being good. The model even had an adjusted R2 of 0.26, which is considered excellent in many fields (like marketing or sociology; R2 is a number between 0 and 1, higher numbers are better).

Nobody, or very very few, would notice that this model is completely made up. The reason is that, in real life, each of these x’s would have a name attached to it. If, for example, y was the amount spent on travel in a year, then some x’s might be x7=”married or not”, x21=”number of kids”, and so on. It is just too easy to concoct a reasonable story after the fact to say, “Of course, x7 should be in the model: after all, married people take vacations differently than do single people.” You might even then go on to publish a paper in the Journal of Hospitality Trends showing “statistically significant” relationships between being married and travel model spent.

And you would be believed.

I wouldn’t believe you, however, until you showed me how your model performed on a set of new data, say from next year’s travel figures. But this is so rarely done that I have yet to run across an example of it. When was the last time anybody read an article in a sociological, psychological, etc., journal in which truly independent data is used to show how a previously built model performed well or failed? If any of my readers have seen this, please drop me a note: you will have made the equivalent of a cryptozoological find.

Incidentally, generating these spurious models is effortless. I didn’t go through 100s of simulations to find one that looked especially misleading. I did just one simulation. Using this stepwise procedure practically guarantees that you will find a “statistically significant” yet spurious model.

1I will explain this unfortunate term later.
2I first did a “univariate analysis” and only fed into the stepwise routine those x’s which singly had p-values < 0.1. This is done to ease the computational burden of checking all models by first eliminating those x's which are unlikely to be "important." This is also a distressingly common procedure.

Continue reading “Example of how easy it is to mislead yourself: stepwise regression”

February 14, 2008 | 14 Comments

Do not calculate correlations after smoothing data

This subject comes up so often and in so many places, and so many people ask me about it, that I thought a short explanation would be appropriate. You may also search for “running mean” (on this site) for more examples.

Specifically, several readers asked me to comment on this post at Climate Audit, in which appears an analysis whereby, loosely, two time series were smoothed and the correlation between them was computed. It was found that this correlation was large and, it was thought, significant.

I want to give you, what I hope is, a simple explanation of why you should not apply smoothing before taking correlation. What I don’t want to discuss is that if you do smooth first, you face the burden of carrying through the uncertainty of that smoothing to the estimated correlations, which will be far less certain than when computed for unsmoothed data. I mean, any classical statistical test you do on the smoothed correlations will give you p-values that are too small, confidence intervals too narrow, etc. In short, you can be easily misled.

Here is an easy way to think of it: Suppose you take 100 made-up numbers; the knowledge of any of them is irrelevant towards knowing the value of any of the others. The only thing we do know about these numbers is that we can describe our uncertainty in their values by using the standard normal distribution (the classical way to say this is “generate 100 random normals”). Call these numbers C. Take another set of “random normals” and call them T.

I hope everybody can see that the correlation between T and C will be close to 0. The theoretical value is 0, because, of course, the numbers are just made up. (I won’t talk about what correlation is or how to compute it here: but higher correlations mean that T and C are more related.)

The following explanation holds for any smoother and not just running means. Now let’s apply an “eight-year running mean” smoothing filter to both T and C. This means, roughly, take the 15th number in the T series and replace it by an average of the 8th and 9th and 10th and … and 15th. The idea is, that observation number 15 is “noisy” by itself, but we can “see it better” if we average out some of the noise. We obviously smooth each of the numbers and not just the 15th.

Don’t forget that we made these numbers up: if we take the mean of all the numbers in T and C we should get numbers close to 0 for both series; again, theoretically, the means are 0. Since each of the numbers, in either series, is independent of its neighbors, the smoothing will tend to bring the numbers closer to their actual mean. And the more “years” we take in our running mean, the closer each of the numbers will be to the overall mean of T and C.

Now let T' = 0,0,0,...,0 and C' = 0,0,0,...,0. What can we say about each of these series? They are identical, of course, and so are perfectly correlated. So any process which tends to take the original series T and C and make them look like T' and C' will tend to increase the correlation between them.

In other words, smoothing induces spurious correlations.

Technical notes: in classical statistics any attempt to calculate the ordinary correlation between T' and C' fails because that philosophy cannot compute an estimate of the standard deviation of each series. Again, any smoothing method will work this magic, not just running means. In order to “carry through” the uncertainty, you need a carefully described model of the smoother and the original series, fixing distributions for all parameters, etc. etc. The whole also works if T and C are time series; i.e. the individual values of each series are not independent. I’m sure I’ve forgotten something, but I’m sure that many polite readers will supply a list of my faults.

February 9, 2008 | 14 Comments

How to look at the RSS satellite-derived temperature data

It’s already well known that the Remote Sensing Systems satellite-derived temperature data has released the January figures: the finding is that it’s colder this January than it has been for some time. I wanted to look more carefully at this data, mostly to show how to avoid some common pitfalls when analyzing time series data, but also to show you that temperatures are not linearly increasing. (Readers Steve Hempell and Joe Daleo helped me get the data.)

First, the global average. The RSS satellite actually divides up the earth in swaths, or transects, which are bands across the earth whose widths vary as a function of the instrument that remotely senses the temperature. The temperature measured at any transect is, of course, subject to many kinds of errors, which must be corrected for. Although this is not the main point of this article, it is important to keep in mind that the number you see released by RSS is only an estimate of the true temperature. It’s a good one, but it does have error (usually depending on the location of the transect), which most of us never see or few actually use. That error, however, is extremely important to take into account when making statements like “The RSS data shows there’s a 90% chance it’s getting warmer.” Well, it might be 90% before taking into account the temperature error: afterwards, the probability might go down to, say, 75% (this is just an illustration; but no matter what, the original probability estimate will always go down).

Most people show the global average data, which is interesting, but taking an average is, of course, assuming a certain kind of statistical model is valid: one that says averaging transects gives an unbiased, low-variance estimate of the global temperature. Does that model hold? Maybe; I actually don’t know, but I have my suspicions it does not, which I’ll outline below.

So let’s look at the transects themselves. The ones I used are not perfect, but they are reasonable. They are: “Antarctica” (-60 to -70 degrees of latitude; obviously not the whole south pole), “Southern Hemisphere Extratropics” (-70 to -20 degrees; there is a 10 degree overlap with “Antarctica”), “Tropics” (-20 to 20 degrees), “Northern Hemisphere Extratropics” (20 to 82.5 degrees, a slightly wider transect than in the SH), and “Arctic” (60 to 82.5 degrees; there is a 22.5 degree overlap with NH Extratropics). Ideally, there would be no overlap between transects, global coverage would have been complete, and I would have preferred more instead of fewer transects, which would have allowed us to see greater detail. But we’ll work with what we have.

Here is the thumbnail of the transects. Click it (preferably open it in a new window so you can follow the discussion) to open the full-sized version.
RSS transects
All the transects are in one place, making it easy to do some comparisons. The scale for each is identical, each has only been shifted up or down so that they all fit on one picture. This is not a very sexy or colorful graph, but it is useful. First, each transect is shown with respect to its mean (the small, dashed line). Vertical lines have been placed at the maximum temperature for each. The peak for NH-Extratropics and Tropics was back in 1998 (a strong El Nino year). For the Arctic, the peak was in 1995. For the Antarctic, it was 1990. Finally, for the SH-Extratropics it was 1981.

You also often see, what I have drawn on the plot, a simple regression line (dash-dotted line), whose intent is usually to show a trend. Here, it appears that there were upward trends for the Tropics to the north pole, no sort of trend for the SH-Extratropics, and a downward trend for the Antarctic (recall there is overlap between the last two transects). Supposing these trends are real, they have to be explained. The obvious story is to say man-made increases due to CO2, etc. But it might also be that the northern hemisphere is measured differently (more coverage), or because there is obviously more land mass in the NH, and—don’t pooh-pooh this—the change of the tilt of the earth: the north pole tipped closer to the sun and so was warmer, the south pole tipped farther and so was cooler. Well, it’s true that the earth’s tilt has changed, and will always do so no matter what political party holds office, but effects due to its change are thought to be trivial at this time scale. Of course, there are other possibilities such as natural variation (which is sort of a cop out; what does “natural” mean anyway?).

To the eye, for example, the trend-regression for the Arctic looks good: there is an increase. Some people showing this data night calculate a classical test of significance (don’t get me started on these), but this is where most analysis usually stops. It shouldn’t. We need to ask, what we always need to ask when we fit a statistical model: how well does it fit? The first thing we can do is to collect the residuals, which are the distances between the model’s predictions and the actual data. What we’d like to see is that there is no “signal” or structure in these residuals, meaning that the model did its job of finding all the signal that there was. The only thing left after the model should be noise. A cursory glance at the classical model diagnostics would even show you, in this case, that the model is doing OK. But let’s do more. Below is a thumb-nail picture of two diagnostics that should always be examined for time series models (click for larger).
RSS transects

The bottom plot is a time-series plot of the residuals (the regression line minus the observed temperatures). Something called a non-parametric (loess) smoothing line is over-plotted. It is showing that there is some kind of cyclicity, or semi-periodic signal, left in the residuals. This is backed up by examining the top plot: which is the auto-correlation function. Each time-series residual is correlated with the one before it (lag 1), with the one two before it (lag 2), and so on. The lag-one correlation is almost 40%, again meaning that the residuals are certainly correlated, and that some signal is left in the residuals that the model didn’t capture. (The “lag 0” is always 1; the horizontal-dashed lines indicated classical 95% significance; the correlations have to reach above these lines to be significant.)

The gist is that the ordinary regression line is inadequate and we have to search for something better. We might try the non-parametric smoothing line for each series, which would be OK, but it is still difficult to ask whether trends exist in the data. Some kind of smoothing would be good, however, to avoid the visual distraction of the noise. We could, as many do, use a running mean, but I hate them and here is why.
Running mean
Show in black is a pseudo-temperature series with noise: the actual temperature is dashed blue. Suppose you wanted to get rid of the noise using a “9-year ” running mean: the result is the orange line, which you can see does poorly, and shifts the actual peaks and troughs to the right. Well, that is only the start of the troubles, but I don’t go over any more here except to say that this technique is often misused, especially in hurricanes (two weeks ago a paper in Nature did just this sort of thing).

So what do we use? Another thing to try is something called Fourier, or spectral analysis, which is perfect for periodic data. This would be just the thing if the periodicities in the data were regular. They do not appear to be. We can take one step higher and use something called wavelet analysis, which is like spectral analysis (which I realize I did not explain), but instead of analyzing the time series globally like Fourier analysis, it does so locally. Which means it tends to under-smooth the data, and even allows some of the noise to “sneak through” in spots. This will be clearer when we look at this picture (again, just a thumb-nail: click for larger).
RSS wavelets

You can see what I mean by some of the original noise “sneaking through”: these are the spikes left over after the smoothing; however, you can also see that the spikes line up with the data, so we are not introducing any noise. The somewhat jaggy nature of the “smoothed” series has to do with the technicalities of using wavelets (I’ll have to explain this a latter time: but for technical completeness, I used a Daubechies orthonormal compactly supported wavelet, with soft probability thresholding by level). Anyway, some things that were hidden before are now clearer.

It looks like there was an increasing trend for most of the series starting in 1996 to 1998, but ending in late 2004, after which the data begin trending down: for the tropics to north pole, anyway. The signal in the southern hemisphere is weaker, or even non-existent at Antarctica.

This analysis is much stronger than the regression shown earlier; nevertheless, it is still not wonderful. The residuals don’t look much better, and are even worse in some places (e.g. early on in the Tropics), than the regression. But wavelet analysis is tricky: there are lots of choices of the so-called wavelet basis (the “Daubechies” thing above) and choices for thresholding. (I used the, more or less, defaults in the R wavethresh package.)

But the smoothing is only a first start. We need to model this data all at once, and not transect by transect, taking into account the different relationships between each transect (I’m still working on a multivariate Bayesian hierarchical time-series model: it ain’t easy!). Not surprisingly, these relationships are not constant (shown below for fun). The main point is that modeling data of this type is difficult, and it is far too tempting to make claims that do not hold up upon closer analysis. One thing is certain: the hypothesis that the temperature is linearly increasing everywhere across the globe is just not true.

APPENDIX: Just for fun, here is a scatter-plot matrix of the data (click for larger): You can see that there is little correlation between the two poles, and more, but less than you would have thought, between bordering transects.
RSS wavelets

How to read this plot: it’s a scatter plot of each variable (transect), with each other. Pick a variable name. In that row, that variable is the y-axis. Pick another variable. In that column, that variable is the x-axis. This is a convenient way to look at all the data at the same time.