Skip to content
December 30, 2007 | 1 Comment

Hurricanes have not increased: misuse of running means

Most statistics purporting to show that there has been an increase in hurricanes do not use the best statistical methods. I want to highlight one particular method that is often misused, and which can lead one to falsely conclude that trends (increasing or decreasing) are present when they actually are not. Read my original post to learn more about this.

That technique is the running mean. As you can see in the rather dramatic graphic from Science Daily, a 9-year running mean has been plotted over the actual hurricane numbers (up to 2005 only) in the North Atlantic. It looks like, in later years, a dramatic upswing is taking place, doesn’t it? This type of plot has shown up in many scientific, peer-reviewed papers.

Science Daily hurricane running mean

Don’t be turned off by the equations! Something very surprising is coming at the end of this article and you will be rewarded if you read to the end.

What is a running mean? A p-year running mean converts a series of actual, observed numbers into a statistical estimate, or model, of what a supposed underlying trend of those numbers might actually be. Because it is a model, its use must first be justified. In math, a 5-year running mean looks like

Running mean equation

where the symbol y indicates the hurricane numbers, and the subscripts t, t-1 and so on indicate the time period: time t is now, time t-1 was last year (now minus one year) and so on. The superscript on the symbol y to the left of the equal sign indicates that this is the modified data value plotted, and is not the actual number. Even if you’re afraid of math, this equation should be fairly easy to understand: the current, modified, number is just the mean of the last 4 observations and the most current one.

Additionally, in the mathematics of time series models, an auto-regressive series of order 5 is written like this

Autoregressive formula

which shows how the current data point is predicted by a weighted sum of past values, and where the weights are the coefficients ?. Just let all the ? = 1/5 and you have a similar running mean structure like that above. The point is this: using a running mean implies an underlying statistical time series model. Which is OK, as long as the data support such a model.

Do they for hurricanes? No.

In order to justify using auto-regressive time series models, you start by looking at something called an auto-correlation plot, which is a plot of how each year’s number of hurricanes is correlated with the previous year’s number, and how this year’s number of hurricanes is correlated with the number of hurricane from two years ago and so on: the number of previous years is called the lag. If any of these correlation lags are significant, then you can use an auto-regressive time series model for this data. If none of these correlations are significant, then you cannot.

Here is a picture of the auto-correlation of hurricane number (number of storms s) for the North Atlantic using data from 1966 to 2006.


None of the correlations reach above the horizontal dashed lines, which means that none are significant, and so a simple running mean should not be used to represent North Atlantic hurricane numbers.

So far, so good, right? Now let’s look at some made up, fictional data. Take a look at the following pictures, which are all simulated hurricane numbers; one of them looks pretty close to what the real data looks like. The running-mean even shows a healthy upward trend, no doubt due to global warming. But what do these pictures really show?

Simulated data with 9-year running mean

To get this data (the R code to make it yourself is pasted bellow), I simulated hurricane numbers (Poisson with mean 10) for pretend years 1966 to 2005, four separate times. Each year’s number is absolutely independent of each other year: to emphasize, these are totally random numbers with no relationship through time. I also over-plotted a 9-year running mean (red line). Because all of these numbers are independent of one another, what we should see is a flat line (with a mean of 10). The reason we do not is because of natural variation.

I only had to run this simulation once, but pay attention to the lower-right hand numbers, I got something that looks like the actual North Atlantic hurricane numbers. The 9-year running mean is over-emphasizing, to the eye, a trend that is not there! Actually, this happens to two of the other simulated series. Only one shows what would expect: a (sort of) straight line.

Like I said, I am including the code I used to make these plots so that, if you are curious, you will see how exceptionally easy this is to do.

Good statistical models are hard to do. See some of these posts for more discussion and to find some papers to download.

R code to make the plots. You must first have installed the package gregmisc.

library(gregmisc)
par(mfrow=c(2,2))
for (i in 1:4){
x=rpois(40,10)
plot(1966:2005,x,type='l',xlab="",ylab="",axes=F)
axis(1)
lines(1966:2005,running(x, width=9, pad=TRUE, fun=mean),lwd=2,col="#CC7711")
}
December 28, 2007 | No comments

Were the cannonballs on or off the road first?

There’s something of a controversy whether photographer Roger Fenton placed cannon balls in a road and then took pictures of them. He also took a picture of the same road cleared of cannon balls. Apparently, there is a question whether the cannon balls were ON the road when he got there, or possibly they were OFF and he placed them there to get a more dramatic photo. This drama unfolds at Errol Morris’s New York Times blog.

Whether they were first ON or OFF (Morris uses the capitals letters, so I will, too), excited considerable interest, with hundreds of people commenting one way or the other, each commenter offering some evidence to support his position.

Some people used the number (Morris uses the ‘#’ symbol) and position of the balls, others argued sun shadows, some had some words about gravity, and so on. Morris compiled the evidence used by both sides, ON (cannon balls on first) and OFF (cannon balls placed there by Fenton), and he presented this summary picture (go to his blog to see the full-sized image):

Morris cannonball evidence pic

This is an awful graph: the order of evidence types is arbitrary, it would have been better to list them in order of importance; the use of color is overwhelming and difficult to follow; and, worst of all, the two graphs are on an absolute scale. 288 people supported ON, and 153 OFF, so counting the absolute numbers and comparing them, as this picture does, is not fair. Of course the ON side, with almost twice as many people, will have higher counts in most of the bins. What’s needed is a percentage comparison.

Continue reading “Were the cannonballs on or off the road first?”

December 27, 2007 | No comments

Will Smith on reprogramming Hitler

Roger Kimball, in his blog, has an entry on the actor Will Smith’s “Reprogramming Hitler” comments. The subject is benevolence. It is well worth reading.

A quote: “The Australian philosopher David Stove got to the heart of the problem when he pointed out that it is precisely this combination of universal benevolence fired by uncompromising moralism that underwrites the cult of political correctness.” He goes on to quote Stove at length (go to the original site to read).

I thought it be helpful to extend Stove’s quote. To those who would suppose that, “Ought not wrongs to be righted?” is a rhetorical question, Stove writes:

It does not follow, from something’s being morally wrong, that it ought to be removed. It does not follow that it would be morally preferable if that thing did not exist. It does not even follow that we have any moral obligations to try to remove it. X might be wrong, yet every alternative to X be as wrong as X is, or more wrong. It might be that even any attempt to remove X is as wrong as X is, or more so. It might be that every alternative to X, and any attempt to remove X, though not itself wrong, inevitably has effects which are as wrong as X, or worse. The inference fails yet again if (as most philosophers believe) “ought” implies “can.” For in that case there are at least some evils, namely the necessary evils, which no one can have any obligation to remove.

These are purely logical truths. But they are also truths which, at most periods of history, common experience of life has brought home to everyone of even moderate intelligence. That almost every decision is a choice among evils; that the best is the inveterate enemy of the good; that the road to hell is paved with good intentions; such proverbial dicta are among the most certain, as well as the most widely known, lessons of experience. But somehow or other, complete immunity to them is at once conferred upon anyone who attends a modern university.

David Stove, On Enlightenment, Transaction Publishers, New Brunswick, New Jersey, p. 174
December 26, 2007 | 8 Comments

How many false studies in medicine are published every year?

Many, even most, studies that contain a statistical component use frequentist, also called classical, techniques. The gist of those methods is this: data is collected, a probability model for that data is proposed, a function of the observed data—a statistic—is calculated, and then a thing called the p-value is calculated.

If the p-value is less than the magic number of 0.05, the results are said to be “statistically significant” and we are asked to believe that the study’s results are true.

I’ll not talk here in detail about p-values; but briefly, to calculate it, a belief about certain mathematical parameters (or indexes) of the probability models is stated. It is usually that these parameters equal 0. If the parameters truly are equal to 0, then the study is said to have no result. Roughly, the p-value is the probability of seeing another statistic (in infinite repetitions of the experiment) larger than the statistic the researcher got in this study, assuming that the parameters in fact equal 0.

For example, suppose we are testing the difference between a drug and a placebo. If there truly is no difference in effect between the two, i.e. the parameters are actually equal to 0, then 1 out of 20 times we did this experiment, we would expect to see a p-value less than 0.05, and so falsely conclude that there is a statistically significant difference between the drug and placebo. We would be making a mistake, and the published study would be false.

Is 1 out 20 a lot?

Suppose, as is true, that about 10,000 issues of medical journals are published in the world each year. This is about right to within an order of magnitude. The number may seem surprisingly large, but there are an enormous number of specialty journals, in many languages, hundreds coming out monthly or quarterly, so a total of 10,000 over the course of the year is not too far wrong.

Estimate that each journal has about 10 studies it is reporting on. That’s about right, too: some journals reports dozens, others only one or two; the average is around 10.

So that’s 10,000 x 10 = 100,000 studies that come out each year, in medicine alone.

If all of these used the p-value method to decide significance, then about 1 out of 20 studies will be falsely reported as true, thus about 5000 studies will be reported as true but will actually be false. And these will be in the best journals, done by the best people, and taking place at the best universities.

It’s actually worse than this. Most published studies do not have just one result which is report on (and found by p-value methods). Typically, if the main effect the researchers were hoping to find is insignificant, the search for other interesting effects in the data is commenced. Other studies look for more than one effect by design. Plus, for all papers, there are usually many subsidiary questions that are asked of the data. It is no exaggeration, then, to estimate that 10 (or even more) questions are asked of each study.

Let’s imagine that a paper will report a “success” if just one of the 10 questions gives a p-value less than the magic number. Suppose for fun that, every question in every study in every paper is false. We can then calculate the chance that a given paper falsely reports success: it is just over 40%.

This would means that about 40,000 out of the 100,000 studies each year would falsely claim success!

That’s too high a rate for actual papers—after all, many research questions are asked which have a high prior probability of being true—but the 5000 out of 100,000 is also too low because the temptation to go fishing in the data is too high.? It is far too easy to make these kinds of mistakes using classical statistics.

The lesson, however, is clear: read all reports, especially in medicine, with a skeptical eye.