Time series are the most abused statistics in the physical sciences. (It’s an endless, raucous, peer-reviewed contest for the worst in the “soft sciences.”)
The Mann problem (is that a typo?) is that time series are too easy to plot, analyze, and make pontifical projections about. The fault is mine, and my brother and sister statisticians’. In our joy of numbers we have made the bar of entry far too low. With our free and mindless software anybody can play with numbers.
How many times must we shout remonstrate warn caution not to wantonly and without rock-solid justification smooth time series? I’ll tell you how many times: forever.
Just as a for example, here is an example of a well done time series that, despite valiant efforts, falls far short of perfection. The data is from our friend Harold Brooks who is at the Severe Storms Laboratory in Norman, Oklahoma and knows all about tornadoes. He collected direct deaths from tornadoes in the United States from 1875 to 2012. The data “represent our best understanding at this time”, meaning there is error in the numbers. (I retrieved US population from this table and the Census Bureau.)
A “direct” death is caused by the tornado itself, such as being knocked on the head. Indirect deaths are absent. Brooks says “Examples of indirect deaths that have occurred include a heart attack upon seeing damage to a neighbor’s house, falls when going to shelter, and a fire caused by a candle lit when the power went out after a tornado.”
Brooks has his own picture of his data (larger version here), which shows the deaths per million citizens in purple dots, a smoother in red, a couple of linear regressions in green, and some projections from the latter regression in cyan. Notice the logarithmic scale.
To prove that the way of plotting can make an enormous difference, here is my version of the same data.
The first problem comes in plotting: It shouldn’t have been done. Making a time-series plot tells viewers these data are the same, they are caused by the same thing. But here the data are not the same and not caused by the same thing.
Consider. These are reported deaths, and there is some ambiguity between “direct” and “indirect” deaths. Given our media is obsessed with all things environmental, it’s unlikely the counts for the past few decades are in error, but some indirect deaths may have been mistakenly classified as direct.
Historically, the counts are probably too low. Every death, especially in rural areas, might not have been reported, and the difference between kinds of deaths is more tangled.
The idea of normalizing by population makes some sense, but the entire US population? How many tornadoes are found in Alaska, Wyoming, or even California where population change was largest? It would have been better to examine the population density in the places tornadoes hit.
Medicine, particularly emergency medicine, has improved immensely over the past fifty years. This would tend to lower deaths.
Housing construction both improved and degraded. Normal, “stick built” houses got better, but they tend latterly to be built in clusters, and when a tornado hits a cluster, well, you know what happens (see that bump in 2011?). In 1875 there were no trailer parks (Brooks and Doswell have a nice sub analysis of trailer park deaths). The overall effect of housing changes can only be a crude guess unless we examine each death and each non-death in detail, a gargantuan task.
Our friend David Legates reminds us that meteorology was barely a science a century ago. Warnings now, especially daylight warnings and in tornado-prone areas, are pretty darn good.
Because of all these and a few more considerations, it’s clear that the data is not the same through time, even though it’s been given the same name through time. There is therefore no justification for any kind of statistical model, especially a smoother.
Smoothers replace the data with guesses of the data, a screwy thing to do. Why substitute uncertainty for certainty? And here the data has measurement error, in the counts themselves and not in why the counts have changed. Why they changed isn’t justifiably quantifiable.
The regression is also misplaced. We don’t need it to “estimate” counts (or percentiles of counts) which we already know. There is a case to be made for a measurement-error model, but to implement it we’d have to know the characteristics of missing data, which we don’t have.
Finally, as repeatedly emphasized, the nature of the cause of “direct” deaths has changed in such a way that no model which isn’t mostly a fiction can quantify. No: the fairest thing to do is to present the data in a tabular or descriptive way and avoid unnecessary quantification which only serves to boost over-certainty.
————————————————–
Thanks to our friend Willie Soon for alerting us to this topic.
The indirect death issue for tornadoes is small, based on detailed studies of a number of events. As examples, 3 indirect with the 36 direct for the 3 May 1999 OKC tornado, 3 indirect with the 158 direct for Joplin, 1 indirect for the 24 direct on 20 May 2014. 10% is likely to be a upper bound on the fraction of indirect deaths. This is small, particularly in comparison to hurricanes where the indirect deaths can be as many or more than the direct deaths (e.g., Andrew in 1992).
It is unlikely that a large number of deaths were missed back as far as 1875. Rural newspapers were a primary source for Signal Corps, the US Weather Bureau and, later Tom Grazulis’s work and they tended to do a good job of covering such information. There is a noticeable change in the data prior to 1875.
You are incorrect to state that the “data is (sic) not the same through time.” The issues you bring up (construction, medicine, etc.) do not change what a death is. They may affect why the rate of death has changed over the years, which is why we discuss in the paper that we do not have an explanation for why the death rate declined after 1925. We bring up communication, forecasting, construction, education, etc. and the only one that has a clear impact is the rate of growth of the mobile home population. It’s amusing that you accuse us of boosting over-certainty when one of the main conclusions of the paper is “We cannot determine the importance of the various factors that have helped the decrease, so we cannot isolate the importance of forecasting or any other particular activity contributing to the overall reduction.” Confusing cause and occurrence of death is an interpretative error. I am very confident that the rate of death in tornadoes dropped by an order of magnitude between the early part of the 20th century and that latter part and has since stopped decreasing. Why the rate has decreased is unknown.
Hi Harold!
You know I still love you, right? Everybody should, as you suggest, read your paper for all the caveats.
The data aren’t (this time I get it right!) the same throughout, not in the sense that the time-series-only model applied to them has much validity. This is where the over-certainty comes in. Not in your excellent collection or explanations. But in the modeling.
Assume no measurement error. Why run a smoother? The smoother replaces the data with made-up data. This replaces certainty with uncertainty. It makes it look like some kind of “driver” is smoothing changing the data. The data are (not is!) the data.
Same thing for the regression. Now if you were to use this regression model to project future direct deaths, then you might be somewhere. But it’s unlikely that this model would be skillful. Do we really expect, on a log 10 scale, that deaths will continue a linear decrease? No, sir, we do not. Especially when deaths are already low.
And then we don’t need the regression estimates of percentiles when we could just pull the exact percentiles from the data. Again, why replace the real data with estimates?
Overlaying the smoother and regression lines produces over-certainty. It says to the viewer, despite all your correct and genuine efforts at cautioning the reader, that more is going than what we really know. It ties and mixes together the changes, if any, in the climate of tornadoes with the changes, which we acknowledge, in the causes of death.
Readers, especially in the media, are not likely to understand these differences, despite you’re being clear about them. They’re likely to say, “Look! tornadoes are growing less deadly!” or something. But that’s a confused statement. Tornadoes, for all we know based only on this plot, may be just as numerous, widespread, and powerful as ever. It’s only medicine and housing that has change. Or again, tornadoes may have changed in several ways. But we can’t get that off this plot. Over-certainty is produced.
And then the plot is normalized by the wrong thing, or at least a curious thing, the population, as I discussed. This also adds to over-certainty.
Like I said, this was a well-done time series and your efforts were valiant. But I stick by my conclusion that no plot would have been better.
As a user of time series, I find your title confusing. You are not talking about problems with time series but with minor data issues. Harold Brooks has clarified the issues in his response.
Don,
And I have clarified why there are troubles in my rebuttal. See also my first link in the post (to many time series discussions).
I was looking for some movement indicating the Dopplar, net arrays, advanced warning systems, and local emergency preparedness planning that came more broadly into use during the 1980s.
Great article. I saw much of the same fussing in the HHS study of deaths from smoking (included deaths from cervical cancer and house fires not necessarily caused by lit cigarettes, the study on asthma from coal power plants).
“The data are (not is!) the data.”
Sorry, you are wrong on this one. As it is used in the English language, data is considered as mass noun (measured not counted like dirt) and as such it is sigular and has no plural form.
Which of the following is correct?
The dirt are dry,
The dirt is dry.
The term data is used in the same way.
Wrong approach. Try survival analysis. Much better.
Briggs:
I know this log plot from elsewhere. Ah yes, Prof. Rabett: http://rabett.blogspot.com/2014/04/compare-and-contrast.html
The middle graph belongs to Pielke, Jr. courtesy of his recent WSJ op-ed, also on the subject of twisters, which in terms of the lack of shouting was a welcome read: http://online.wsj.com/news/articles/SB10001424052702303603904579495581998804074
I expect the third graph in Eli’s post to raise your oil pressure. (No numbers were smoothed in the making of this prediction.)
“Why substitute uncertainty for certainty?”
I note the rhetorical nature of the question as I answer. We don’t like uncertainty. Why write cosmological arguments?
Setting aside Briggs’ more substantive criticisms, why would anyone plot this data semi-log when the range is .6-8? Most casual observers would guess that “deaths” are going down linearly.
Actually, Wyoming averages around 11 or 12 tornadoes per year (depending on the source). This is more than most of the states to the north and west of Wyoming. Alaska rarely has tornadoes–average is less than one per year. Of course, in Wyoming there’s not much for the tornadoes to hit, so most go unnoticed (except for the one the Storm Watchers filmed for the Weather Channel, I believe). If you look at a map of averages per state, it becomes very apparent that tornadoes are usually east of the Rocky Mountains. Combine that with the lower population in much of this area and the number of deaths are going to be much lower than the eastern US. An average gives a really skewed idea of the danger of tornadoes across the US.
Michael: Or less casual observers might see the semi-log plot and infer an exponential relationship with respect to time that might not really be there. If the graph was radioactive decay a semi-log plot would make perfect sense. Stock market data sometimes makes sense in a semi-log plot, but using trends over time, log-plotting or not, will bust the portfolio eventually.
And that’s his main beef. He makes good points, but they’re not actually substantive to his meta-argument against climate *science*. How that science gets communicated to policymakers and the public, entirely different story.
Sheri: The Right Way to do it is ignore state boundaries and use a grid. For each grid, guesstimate population, structure types, forecasting and EWS effectiveness, etc., over time. Then plot ground tracks, damage assessments and loss of life for each grid. Which would be fiendishly expensive and have diminishing returns the further back in time you go.
Doing so might give results that are closer to reality, but a better focus might be on data gathering going forward. The aim being to protect storm survivability, i.e. keeping people not-dead. Not research as grist for the “is AGW real or not” debate. There are far better large scale indicators for doing that than chasing after tornadoes.
Whenever I see a chart I ask what question is the chart answering, what is the definition of the measurement and in what context is the question being asked. Questions are usually not isolated and one offs but are part of series of critical thinking where an answer to a question leads to more questions. I have no context looking at the chart on tornadoes and therefore cannot make any useful conclusions.
Willie Soon? Still doing his junk science, published in fringe journals?