Skip to content
January 3, 2008 | No comments

Oil prices through time: a well-done graph

Wall Street Journal Oil prices
Source: Wall Street Journal

This image, shrunk 50% from it’s original size, is a very well-done statistical graph. It shows, in the pale green line, the inflation-adjusted price of a barrel of oil from 1975 to 2007. The blue line is the nominal price in dollars (which are “dollars at the time,” and obviously not adjusted for inflation).

This is exactly the way to present price data through time. The recent three-year surge in oil prices is remarkable, and more shocking using only the nominal data, but it also more misleading. The inflation-adjusted price is more honest and shows that we have reached these levels before.

The one major difference between the two peaks of today and of 1980 are the time it took to reach the peak. It took us three to four years of steady increase to reach the current levels. Back then, the rise was truly dramatic, taking place in just a little over a year. Of course, we do not know whether or not we are at the peak now.

The other good feature of this graph are the the bullet point at various interesting places. If you mouse over the original you can see explanations of the dates.

There are four other plots that accompany the oil-price chart. These are not so well done, especially the “Use vs. Population” charts. The idea is to show the per-capita use of oil per country. But nowhere are the actual per-capita figures printed, though the population size and oil use per day numbers are given (in separate locations). A map which attempts to show these figures is presented, but the legend is confusing and the map hard to read.

The graphs “The China Factor” and “Global Oil Picture” suffer from the common flaws associated with bar and pie charts, so I won’t go over them. The last chart, “The Biggest Users” does present the per-capita demand in a useful way, and also shows the change in this measure from the late 1990s to the current time. This part of the graph could use some help, and I’ll suggest some changes for it later.

January 2, 2008 | No comments

Frank Furedi on the global warming apocalypse

Frank Fuerdi, of Spiked Online, has a delightful article on the daily barrage of panic we hear, in which he says, “In the past year, the threat of doom ? from weather, terror or disease ? became an everyday, even banal issue. It?s time to inject a dose of humanism into public debate.”

The thing that I have noticed, in talking about global warming to civilians, is not just the readiness of people to believe that doom is just around the corner—this after all is what Frank Furedi shows us is the relentless message of the day—but what is strange to me is people’s eagerness to believe the worst pronouncements. Even after you show people that the most apocalyptic claims are nonsense and are politically motivated (e.g. Al Gore’s imminent flooding of Manhattan), they still retain an ardent desire for the worst to be true.

It is this desire that must be investigated. The following quote from the Furedi article might help:

In response to the growing influence of misanthropy, Pope Benedict XVI, in his message for World Peace Day on 1 January 2008, felt the need to remind his audience that ?respecting the environment does not mean considering material or animal nature more important than man?. That the Pope felt it was necessary to remind people of the unique status of the human species is telling indeed; it shows that we really do live in an era when most leaders find it difficult to believe in anything other than a scary future, and where it takes a Pope to remind them that humans are actually quite special.

January 1, 2008 | 1 Comment

Calculated Risks: How to know when numbers deceive you: Gerd Gigerenzer

Gerd Gigerenzer, Simon and Schuster, New York, 310 pp., ISBN 0-7432-0556-1, $25.00

Should healthy women get regular mammograms to screen for breast cancer?

The surprising answer, according to this wonderful new book by psychology professor Gerd Gigerenzer, is, at least for most women, probably not.

Deciding whether to have a mammogram or other medical screening (the book examines several) requires people to calculate the risk that is inherent is taking these tests.? This risk is usually poorly known or communicated and, because of this, people can make the wrong decisions and suffer unnecessarily.

What risk, you might ask, is there for an asymptommatic woman in having a mammogram? To answer that, look at what could happen.

The mammogram could correctly indicate no cancer, in which case the woman goes away happy.? It could also correctly indicate true cancer, in which case the woman goes away sad and must consider treatment.

Are these all the possibilities?? Not quite.? The test could also indicate that no cancer is present when it is really there—the test could miss the cancer.? This gives false hope and causes a delay in treatment.

But also scary and far more likely is that the test could indicate that cancer is present when it is not.? This outcome is called a false positive, and it is Gigerenzer’s contention that the presence of these false positives are ignored or minimized by both the medical profession and by interest groups whose existence is predicated on advocating frequent mammograms (or other disease screenings, such as for prostate cancer or AIDS).

Doctors like to provide an “illusion of certainty” when, in fact, there is always uncertainty in any test.? Doctors and test advocates seem to be unaware of this uncertainty, they have different goals than do the patients who will receive the tests, and they ignore the costs of false positives.

How is the uncertainty of a test calculated?? Here is the standard example, given in every introductory statistics book, that does the job. This example, using numbers from Gigerenzer, might look confusing, but read through it because its complexity is central to understanding the his thesis.

If the base rate probability of breast cancer is 0.8% (the rate of cancer in women in the entire country), and the sensitivity (ability to diagnose the cancer when it is truly there) and specificity (ability to diagnose no cancer when it is truly not there) of the examination for cancer is 90% and 93%, then given that someone tests positive for cancer, what is the true probability that this person actually has cancer?

To answer the question requires a tool called Bayes Rule.? Gigerenzer has shown here, and in other research, that this tool is unnatural and difficult to use and that people consistently poorly estimate the answer. Can you guess what the answer is?

Most people incorrectly guess 90% or higher, but the correct answer is only 9%, that is, only 1 woman out of every 11 who tests positive for breast cancer actually has the disease, while the remaining 10 do not.

If people instead get the same question with the background information in the form of frequencies instead of probabilities they do much better.? The same example with frequencies is this: If out of every 1000 women 77 have breast cancer, and that 7 of these 77 who test positive actually have the disease, then given that someone tests positive for cancer what is the true probability that this person actually has cancer?

The answer now jumps out—7 out of 77—and is even obvious, which is Gigerenzer’s point.? Providing diagnostic information in the form of frequencies benefits both patient and doctor because both will have a better understanding of the true risk.

What are the costs of false positives?? For breast cancer, there are several.? Emotional turmoil is the most obvious: testing positive for a dread disease can be debilitating and the increased stress can influence the health of the patient negatively.? There is also the pain of undergoing unnecessary treatment, such as mastectomies and lumpectomies.? Obviously, there is also a monetary cost.

Mammograms can show a noninvasive cancer called ductal carcinoma in situ, which is predominately nonfatal and needs no treatment, but is initially seen as a guess of cancer. There is also evidence that the radiation from the mammogram increases the risk of true breast cancer!

These costs are typically ignored and doctors and advocates usually do not acknowledge the fact the false positives are possible.? Doctors suggest many tests to be on the safe side—but what is the safe side for them is not necessarily the safe side for you. Better for the doctor to have asked for a test and found nothing than to have not asked for the test and miss a tumor, thus risking malpractice.

This asymmetry shows that the goals of patients and doctors are not the same.? The same is true for advocacy groups.? Gigerenzer studies brochures from these (breast cancer awareness) groups in Germany and the U.S. and found that most do not mention the possibility of a false positive, nor the costs associated with one.

Ignoring the negative costs of testing makes it easier to frighten women into having mammograms, and he stresses that, “exaggerated fears of breast cancer may serve certain interest groups, but not the interests of women.”

Mammograms are only one topic explored in this book.? Others include prostate screenings “where there is no evidence that screening reduces mortality”, AIDS counseling, wife battering, and DNA fingerprinting.

Studies of AIDS advocacy group’s brochures revealed the same as in the breast cancer case: the possibility of false positives for screenings and the costs associated with these mistakes were ignored or minimized.

Gigerenzer even shows how attorney Alan Dershowitz made fundamental mistakes calculating the probable guilt of O.J. Simpson, mistakes that would have been obvious had Dershowitz used frequencies instead of probabilities.

The book closes with tongue-in-cheek examples of how to cheat people by exploiting their probabilistic innumeracy, and includes several fun problems.

Gigerenzer stresses that students have a high motivation to learn statistics but that it is typically poorly taught.? He shows that people’s difficulties with numbers can be overcome and that it is in our best interest to become numerate.

December 30, 2007 | 1 Comment

Hurricanes have not increased: misuse of running means

Most statistics purporting to show that there has been an increase in hurricanes do not use the best statistical methods. I want to highlight one particular method that is often misused, and which can lead one to falsely conclude that trends (increasing or decreasing) are present when they actually are not. Read my original post to learn more about this.

That technique is the running mean. As you can see in the rather dramatic graphic from Science Daily, a 9-year running mean has been plotted over the actual hurricane numbers (up to 2005 only) in the North Atlantic. It looks like, in later years, a dramatic upswing is taking place, doesn’t it? This type of plot has shown up in many scientific, peer-reviewed papers.

Science Daily hurricane running mean

Don’t be turned off by the equations! Something very surprising is coming at the end of this article and you will be rewarded if you read to the end.

What is a running mean? A p-year running mean converts a series of actual, observed numbers into a statistical estimate, or model, of what a supposed underlying trend of those numbers might actually be. Because it is a model, its use must first be justified. In math, a 5-year running mean looks like

Running mean equation

where the symbol y indicates the hurricane numbers, and the subscripts t, t-1 and so on indicate the time period: time t is now, time t-1 was last year (now minus one year) and so on. The superscript on the symbol y to the left of the equal sign indicates that this is the modified data value plotted, and is not the actual number. Even if you’re afraid of math, this equation should be fairly easy to understand: the current, modified, number is just the mean of the last 4 observations and the most current one.

Additionally, in the mathematics of time series models, an auto-regressive series of order 5 is written like this

Autoregressive formula

which shows how the current data point is predicted by a weighted sum of past values, and where the weights are the coefficients ?. Just let all the ? = 1/5 and you have a similar running mean structure like that above. The point is this: using a running mean implies an underlying statistical time series model. Which is OK, as long as the data support such a model.

Do they for hurricanes? No.

In order to justify using auto-regressive time series models, you start by looking at something called an auto-correlation plot, which is a plot of how each year’s number of hurricanes is correlated with the previous year’s number, and how this year’s number of hurricanes is correlated with the number of hurricane from two years ago and so on: the number of previous years is called the lag. If any of these correlation lags are significant, then you can use an auto-regressive time series model for this data. If none of these correlations are significant, then you cannot.

Here is a picture of the auto-correlation of hurricane number (number of storms s) for the North Atlantic using data from 1966 to 2006.

None of the correlations reach above the horizontal dashed lines, which means that none are significant, and so a simple running mean should not be used to represent North Atlantic hurricane numbers.

So far, so good, right? Now let’s look at some made up, fictional data. Take a look at the following pictures, which are all simulated hurricane numbers; one of them looks pretty close to what the real data looks like. The running-mean even shows a healthy upward trend, no doubt due to global warming. But what do these pictures really show?

Simulated data with 9-year running mean

To get this data (the R code to make it yourself is pasted bellow), I simulated hurricane numbers (Poisson with mean 10) for pretend years 1966 to 2005, four separate times. Each year’s number is absolutely independent of each other year: to emphasize, these are totally random numbers with no relationship through time. I also over-plotted a 9-year running mean (red line). Because all of these numbers are independent of one another, what we should see is a flat line (with a mean of 10). The reason we do not is because of natural variation.

I only had to run this simulation once, but pay attention to the lower-right hand numbers, I got something that looks like the actual North Atlantic hurricane numbers. The 9-year running mean is over-emphasizing, to the eye, a trend that is not there! Actually, this happens to two of the other simulated series. Only one shows what would expect: a (sort of) straight line.

Like I said, I am including the code I used to make these plots so that, if you are curious, you will see how exceptionally easy this is to do.

Good statistical models are hard to do. See some of these posts for more discussion and to find some papers to download.

R code to make the plots. You must first have installed the package gregmisc.

for (i in 1:4){
lines(1966:2005,running(x, width=9, pad=TRUE, fun=mean),lwd=2,col="#CC7711")