Philosophy

The Data Is The Data, Not The Model: With Climatology Time Series Example

How not to plot

The following plot was sent to me yesterday for comment. I cannot disclose the sender, nor the nature of the data, but neither of these are the least essential to our understanding of this picture and what has gone horribly, but typically, wrong.

How not to think about a time series

How not to think about a time series

There is one data point per year, measured unambiguously, with the item taking values in the range from the mid 20s to the high 50s. Lets suppose, to avoid tortured language, the little round circles represent temperatures at a location, measured, as I say, unambiguously, without error and such that the manner of measurement was identical each year.

What we are about to discuss applies to any—as in any—plot of data which is measured in this fashion. It could be money instead of temperature, or counts of people, or numbers of an item manufactured, etc. Do not fixate on temperature, though it’s handy to use for illustration, the abuses of which we’ll speak being common there.

The little circles, to emphasize, are the data and are accurate. There is nothing wrong with them. As the box to the right tells us, there are 18 values. The green line represents a regression (values as linear function of year); as the legend notes, the gray area shows the 95% confidence limits. Let’s not argue why frequentism is wrong and that, if anything, we should have produced credible intervals. Just imagine these are credible intervals. The legend also has a place value for “95% Prediction Limits”, but this isn’t plotted. Ignore this for now. The box to right gives details on the “goodness” of fit of this model, R-Square MSE and the like.

Questions of the data

Now let me ask a simple question of this data: did the temperature go down?

Did you say yes? Then you’re right. And wrong. Did you instead say no? Then you too are right. And just as wrong.

The question is not simple and is ill phrased: as it is written, it is ambiguous. Let me ask a better question: did the temperature go down from 1993 to 2010? The only answer is yes. What instead if I asked: what is the probability the temperature went down from 1993 to 2010? The only answer (given this evidence) is 1. It is 100% certain the temperature decreased from 1993 to 2010.

How about this one? Did the temperature go down from 1993 to 2007? The answer is no; it is 100% certain the temperature increased. And so forth for other pairs of dates (or other precise questions). The data is the data.

Did the temperature go down in general? This seems to make sense; the eye goes to the green line, and we’re tempted to say yes. But “in general” is ambiguous. Does that mean, from year-to-year there were more decreases than increases? There were half of each: so, no. Does the question mean that temps in 2001 or 2002 were lower than 1993 but higher than 2010? Then yes, but barely. Does it mean if I take the mean temps from 1993 to 2001 and compare it against the mean from 2002 to 2010? Then maybe (I didn’t do the math).

Asking an ambiguous question lets the user “fill in the blank”, different opinions can be had, merely because nobody is being precise. What we should do is just plot the data and leave it at that. Any question we can ask can be answered with 100% certainty. The data is the data. That green line—which is not the data—and particularly that gray envelope is an enormous distraction. So why plot it?

What is a trend?

It appears as if somebody asked: was there a trend? Again, this is ambiguous. What’s a “trend”? This person thought it meant the straight line one could draw with a regression. That means this person said it was 100% certain that this regression model was certain; that no other model could represent the uncertainty in the observed data than this one. But there are many, many, many other meanings of “trend” and other models which are possibilities.

No matter which model is chosen, no matter what, the data trumps the model. The green line is not the data. The data is the data. It makes no sense to abandon the data and speak only of the model (or its parameters). You cannot say: temperatures decreased, for we already have seen this is false or true depending on the years chosen. You can say “there was a negative trend” but only conditional on the model being true. And then a negative trend in the model does not correspond to a negative change in the data, not always.

Assume the regression is the best model of uncertainty. Is the “trend” causal? Does that regression line (or its parameters) cause the temperatures to go down? Surely not. Something physical causes the data to change: the model does not. There is no hidden, underlying forces which the model captures. The model is only of the data’s uncertainty, quantifying the chance the data takes certain values.

But NOT the observed data. Just look at the line: it only goes though one data point. The gray envelope only contains half or fewer of the data points, not 95% of them. In fact, the model is SILENT on the actual, already observed data, which is why it makes no sense to plot a model on the data, when the data does not need this assistance. Since the model quantifies uncertainty, and there is no uncertainty in the observed values, the model is of no use to us. It can even be harmful if we, like many do, substitute the model for the data.

We cannot, for instance, say “The mean temperature in 2001, according to the model, was 38.” This is nonsensical. The actual temperature in 2001 was 25, miles away from 38. What does that 38 mean? Not a thing. It quite literally carries no meaning, unless we consider this another way to say “false.” It was 100% certain the temperature in 2001 was 25, so there is no plus or minus to consider, either.

What’s a model for?

Again I say, the data is the data, and the model something else. What, exactly?

Well, since we are supposing this model is the best way to represent our uncertainty in values the data will take, we apply it to new data, yet unseen. We could ask questions like, “Given the data observed and assuming the model true, what is the probability that temperatures in 2011 are greater than 40?” or “Given etc., what is the probability that temps in 1992 were between 10 and 20?” or whatever other years or numbers which tickle our fancy. It is senseless, though, to ask questions of the model about data we have already seen. We should just ask the data itself.

Then we must wait, and this is painful, for waiting takes time. A whole year must pass before we can even begin to see whether our model is any good. Even then, it might be that the model “got lucky” (itself ambiguous), so we’d want to wait several years so we can quantify the uncertainty that our model is good.

This pain is so acute in many that they propose abandoning the wait and substituting for it measures of model fit (the R-Squared, etc.). These being declared satisfactory, the deadly process of reification begins and the green line becomes reality, the circles fade to insignificance (right, Gav?). “My God! The temperatures are decreasing out of control!” Sure enough, by 2030, the world looks doomed—if the model is right.

Measures of model fit are of very little value, though, because we could always find a model which recreates the observed data perfectly (fun fact). That is, we can always find a better fitting model. And then we’d still have to wait for new observations to check it.

Lastly, if we were to plot future values, then we’d want to use the (unseen) prediction limits, and not the far-far-far-too-narrow confidence limits. The confidence limits have nothing to say about actual observable data and are of no real use.

Today’s lesson

The data is the data. When desiring to discuss the data, discuss the data, do not talk about the model. The model is always suspect until it can be checked. That always takes more time than people are willing to give.

Categories: Philosophy, Statistics

26 replies »

  1. As an engineer looking at that data I would describe the linear regression as mostly useless and therefore not include it on a chart for general presentation. Other engineers might be interested in the green line and I might leave it displayed for them but I can assure you that most of them would share my opinion. I haven’t often been guilty of confusing the map with the territory. And I agree with you that many people could easily be misled by a presentation like this.

  2. So why plot it?

    Identification of an appropriate model for predictions is usually the starting point. The plot shown here can give me hints as to why R-square is small, whether there are outliers, and why it may not be good to use the resulting estimated linear model for predictions. Based on the plot, possible problems that need further investigation include (1) fitting a linear model to this data set may not be appropriate and (2) the error may have an AR(1) or AR(2) structure because a negative (positive) residual appears to be more likely to be followed by another negative (positive) one.

  3. I have wondered about this many times. One can find all kinds of graphs and charts showing data, but rarely does one find just the raw data. In your example, you at least left the data points visible. That’s not often the case.

  4. I understand your point, that the data is the data.

    But, I am curious about the analysis that was done to this data. Someone has built a model where half the observed data is outside of the 95% confidence interval. Sounds like a really bad model to me!

    One more question on the model provided, is there any reason to blieve that the slope of this trend line is “significantly” different from zero?

    As I eyeball the data, I see that there are 6 observations below 30 degrees, and 12 obersvations above 35, and nothing in the middle. What is happening in the cold years? Until, I could explain that, any other patterns in the data, such as trends, are irrelevant.

    What inferences do you get from the data?

  5. One of my professors would jokingly warn us students that the only thing more dangerous than extrapolation was predicting the future. The curve fit tells you something about the existing data but it tells you nothing about future data. Foretelling the future is best left to experts like Karnack the Magnificent.

  6. The data tell me there’s much variation over the 18 year period and a suggestion that there may be some influence on a point by the value of the previous point. There’s also a hint of a cyclical period of 8 or 9 years. In other words, I see patterns — something the mind is very good at doing, even if they aren’t really there.

    There, three observations about the data and nary a word about models.

  7. Hmmm. If you simply plot the data, about all you can say is the data is the data and it is quite variable. Now that won’t make the news any more than Chicken Little saying: “I was hit on the head by an acorn while I passed the oak tree.” Saying “The sky is falling!” is much more news worthy and likely to hit the 11PM newscast.

    Unfortunately, Chicken Little’s story is no where near what actually happened any more than the least squares linear fit is the data. Yet saying one end of the line is higher than the other is news and projecting the line into the future spells a story of disaster which is even more news worthy.

    Now, what could one conclude about the news from the above?

    1. It must be true because it was on the 11PM News.
    2. It is a cracked crock that can’t hold water.

  8. Yes, the data is the data in this case, but this is not always the case. Very often the “data” points on a figure are themselves statistical functions. If these numbers represent average temperatures, then that is just what they are – statistical functions not data and some level of uncertainty exists in the data points themselves and this should be reflected in the figure.

    Of course, none of this makes the regression line any more valid, and how the heck can they plot a 95% confidence limit where half the observations fall outside? What kind of function were they using to get that?

  9. BRIGGS,

    You can re-do this essay from a type of manufacturing process perspective (a converse of the emphasis presented here): Consider each point the parameter of significance for a given manufactured part. Over time, the process invariably “drifts” prompting tuning adjustments in the manufacturing process to keep the product within satisfactory tolerances.

    Those tolerances are what is portrayed on the chart and as the data points start getting too close to some pre-determined margin, someone makes a decision to stop manufacturing for a while to tune the equipment.

    While that analogy doesn’t exactly align with the premise that the mesurements reflect some process being studied, it does show that to craft a meaningful model one really needs to understand why given measurements occur as they do & trend in whatever direction. If one is talking about climate & temperature & weather & so forth…its clear nobody has a sufficiently refined understanding to model much of anything.

    At any rate, understanding a pragmatic model from the referece perspective emphasizing allowable tolerance bands rather than the data points helps novices understand the concepts.

  10. Doug M,

    Yes, I know. I was being “comedic.”

    All,

    It was just pointed out to me that the “prediction bounds” are there, that at least one person can see them clearly on his browser. My eyes are good enough to see it, however. But it doesn’t matter. There is nothing to predict: we already know what the data was.

  11. Gaah. I would conclude that a linear regression is inappropriate for the process. Although the second half splits lower (about 2:7) than the first half (about 7:2), there are not enough data points to make the bet worthwhile. A closer look shows no evidence of a trend (a single cause operating continually throughout the time period moving the data generally in the same direction). It is more like a mixture pattern, which means either multiple shifts (causes operating at points in time to move the central tendency of the process up or down) or a mixture (the data points actually come from two distinct processes (e.g., two heads, two inspectors, two instruments, or something of the sort). The reason for this judgment – and at 18 data points it is only a judgment – is that the data points #8-11,17-18 seem to come from a consistent stationary series running around 27, while the data points #1-7,12-16 come from a different series averaging around 40. Either there is one process that is sometimes apples and sometimes oranges; or there are two processes and sometimes we pull from the apples and sometimes from the oranges.

    I have seen similar patterns in defective glass phials inspected by two different inspection crews on different days, and again in print-to-perf registration of blister packs measured with two different rulers.
    + + +

    It’s the “made without error” part that gets me Irish up. Never seen one of those before.

  12. OK. I’ll step in it.
    Do statisticians ever try Fourier analysis?
    I see at least 4 sine waves in that data.
    (But I also see an insufficient set of data from which to draw any conclusions).
    Guess what – that is because I am normally biased to look for sinusoids in my data.
    Somebody has to tell me their hypothesis is for a linear function in this case.
    Then I would laugh.

  13. As an engineer, I often do trend plots. In the case of the 18-point data plot being discussed, I could see myself telling Excel to display the equation for the line, then subtracting it out. The reason for doing this is that everything drifts, and separating out individual contributers of the drift is usually unproductive, so just remove the trend and throw it away.

    Looking at the data, I’m drawn to the year 2000. Was there a process change in 2000 that caused a shift? Did other process parameters shift in 2000? Was the process changed back again in 2004?

    Aha! When I pull up the building humidity historical record, I can see a shift from 2000 to 2003, then back again. And I can come up with a plausible explanation linking the process shift to the ambient humidity. Something was definitely different with the HVAC system from 2000 through 2003.

    At this point you are no doubt laughing at me, squeezing all this out of a piddly 18 points of data. But it’s costing us thousands of dollars a day and my boss is breathing down my neck and I haven’t a clue what else it could be. And it points to an experiment that can be performed.

    So, what, you want me to go to my boss and tell him “sorry, boss, the data is the data, and data says we’re screwed”? Or am I going to instead present him with my half-baked theory and suggest a course of action?

    This behavior is lauded as “a bias for action”. Hmmm, I wonder if this has any relevance to the AGW issue?

  14. @ Bill S.
    “Do statisticians ever try Fourier analysis?”
    I have a least squares curve fitting program that fits orthogonal polynomials (like Legendre polynomials) to the data exactly the same way a Fourier series fits sines and cosines to the data. Sines and cosines are othogonal functions that are fitted to the data by a process of integration. Ditto for orthogonal polynomials. You can show mathematically that an orthogonial polynomial series is the best approximation to data in the sense of minimum squared error. I see at least one orthogonal polynomial in that data.

  15. Ray,
    We use associated Legendre polnomials to characterize non linearities in hardware that we designed to exhibit a linear ramp. But when my boss tells me to look at some data without making assumptions I look for dc offset, check out the FFT, do a histogram, and anything else I can think of before curve fitting with a polynomial.
    All,
    I admit to an unwarranted bias for polynomials. You don’t need any bloody theory to fit data to a polynomial. Dissenters kindly respond!

  16. A conscientious Frequentist would look at the summary statistics and perform a t test on the coefficient of variation, comparing it to zero, and get a p-value of 0.07. So even a hack statistician, if he’s honest, should conclude there’s nothing to see here, so let’s move along.

    Data points outside the confidence interval? Happens all the time; notice they DO fall inside the prediction interval, the one generated from a model that is NOT “statistically significant”.

    The data does look autocorrelated, but 18 points is thin gruel for estimation; the same t test used for the slope gives a p-value of 0.11 for the first-order autocorrelation.

    Polynomials are great fun for curvilinear data, but without some (bloody) phenomenological theory, they’re just a statistician’s version of Connect the Dots. Hell, it’s a TIME SERIES; you can fit it perfectly with the right polynomial, and get an R-squared of 1. Or, you can get results that are just as pretty with a nonparametric loess fit, be honest about not making assumptions, and have a graphic to hit the engineer/climatologist/anesthetist up the side of the head with to jolt a theory out. Then unleash the (low-order) polynomials.

    I gotta agree with Briggsy, most of the chart is one Big Fat Lie. Better it should have just shown the data and left well enough alone. I’m stealing the data (it’s already been suitably anonymized) and giving it to my undergrads next semester as a critical thinking exercise.

    Mike Anderson
    University of Texas at San Antonio

  17. I agree with your point.

    However, the graph does in fact display the 95% Prediction Limits that you say are not there. Look for the subtle light-green dashed lines. All of the data points fall inside of them.

  18. If anyone wishes to ask questions about my paper “Planetary Surface Temperatures”, or if you believe you have an alternative explanation for the Venus surface temperature, please post your question or response below this post as I wish to keep all discussion on the one thread. There is also discussion there regarding today’s article on PSI which I did not write myself, by the way.

  19. Dr. Briggs,

    This line stood out to me:

    > Something physical causes the data to change: the model does not.

    Why wouldn’t a geophysical model, like a probabilistic dynamical climate model which spins up from initial conditions, be more useful than a regression model to predict the next data point? With climate dynamics, at least there are physical signals that can be observed and taken into consideration for prognosing the next observed data point in the series, such as as ENSO, sun spots, Pacific Decadal Oscillation (PDO), soil moisture etc.? Nate Silver’s new book, The Signal and the Noise, has quite a bit of content about meteorology and climatology including the physical modeling. Its amazing – a miracle of modern humanity – that brainiacs have designed and authored complex numerical weather and climate models which are sometimes useful (useful enough that humanity uses them day-in day-out)! I don’t think most statisticians (other than yourself – you seem to be rare) totally grok how these dynamical models work (its almost as if the statisticians when they meet a geophysical weather model act as if they’re seeing a new alien species for the first time visiting the planet)!

  20. Dr. Briggs, as a followup to my previous comment, I should have clarified that I have not read in full Nate Silver’s book (someone gave it to me as Christmas present) so I’ve only paged through it. An excerpt from his book was also published a few or so months back (I think in the NY Times) with a cutesy title like “the weatherman is not an idiot”. Anyway, I later found your post from a few months ago about Nate Silver and I noticed you “bet” on Romney (a bet you hadn’t changed since last December hence a year ago). I didn’t read your year ago post (hence don’t know what data and/or model you used to place your bet) but the net result is that Nate Silver comes out smelling like a rose. Paging through his book, even as a laymen that I am (but with an interested in geophysics) so far at first glance I think he’s done a nice job at demonstrating just how amazing many full fledged meteorologists and climatologists are and the complexity they have to deal with on a day-in day-out basis, and how they have been able to effectively save lives. There seems to be a mindset among some people in businesses the past several years who think they know something about weather and climate but then their overconfidence on a subject they don’t inherently grok, their hubris ends up getting the best of them.

Leave a Reply

Your email address will not be published. Required fields are marked *