## Lovely Example of Statistics Gone Bad

The graph above (biggified version here) was touted by Simon Kuestenmacher (who posts many beautiful maps). He said “This plot shows the objects that were found to be ‘the most distant object’ by astronomers over time. With ever improving tools we can peak further and further into space.”

The original is from Reddit’s r/dataisbeautiful, a forum where I am happy to say many of the commenter’s noticed the graph’s many flaws.

Don’t click over and don’t read below. Take a minute to first stare at the pic and see if you can see its problems.

Don’t cheat…

Try it first…

*Problem #1:* The Deadly Sin of Reification! The mortal sin of statistics. The blue line *did not happen*. The gray envelope *did not happen*. What happened where those teeny tiny too small block dots, dots which fade into obscurity next to the majesty of the statistical model. Reality is lost, reality is replaced. The model becomes realer than reality.

You cannot help but be drawn to the continuous sweet blue line, with its guardian gray helpers, and think to yourself “What smooth progress!” The black dots become a distraction, an impediment, even. They soon disappear.

Problem #1 one leads to Rule #1: If you want to show what happened, *show what happened*. The model did not happen. Reality happened. Show reality. Don’t show the model.

It’s not that models should never be examined. Of course they should. We want good model fits over past data. But since good models fits over past data are trivial to obtain—they are even more readily available than student excuses for missing homework—showing your audience the model fit when you want to show them what happened misses the point.

Of course, it’s well to *separately* show model fit when you want to honestly admit to model flaws. That leads to—

*Problem #2:* Probability Leakage! What’s the y-axis? “Distance of furthest object (parsecs).” Now I ask you: can the distance of the furthest object in parsecs be *less than 0*? No, sir, it cannot. But does both the blue line and gray guardian drop well below 0? Yes, sir, they do. And does that imply the impossible happened? Yes: yes, it does.

The model has given real and substantial probability to events which could not have happened. The model is a bust, a tremendous failure. The model stinks and should be tossed.

Probability leakage is when a model gives positive probability to events we know are impossible. It is more common than you think. Much more common. Why? Because people choose the parametric over the predictive, when they should choose predictive over parametric. They show the plus-or-minus uncertainty in some who-cares model parameters and do not show, or even calculate, the uncertainty in the actual observable.

I suspect that’s the case here, too. The gray guardians are, I think, the uncertainty in the parameter of the model, perhaps some sort of smoother or spline fit. They do not show the uncertainty in the *actual distance*. I suspect this because the gray guardian shrinks to near nothing at the end of the graph. But, of course, there must still be some healthy uncertainty in the model distant objects astronomers will find.

Parametric uncertainty, and indeed even parameter estimation, are largely of no value to man or beast. Problem #2 leads to Rule #2: You made a model to talk about uncertainty in some observable, so talk about uncertainty in the observable and not about some unobservable non-existent parameters inside your *ad hoc* model. That leads to—

*Problem #3:* We don’t know what will happen! The whole purpose of the model should have been to quantify uncertainty in the future. By (say) the year 2020, what is the most likely distance for the furthest object? And what uncertainty is there in that guess? We have no idea from this graph.

We should, too. Because *every* statistical model has an implicit predictive sense. It’s just that most people are so used to handling models in their past-fit parametric sense, that they always forget the reason the created the model in the first place. And that was because they were interested in the now-forgotten observable.

Problem #3 leads to Rule #3: always show predictions for observables never seen before (in any way). If that was done here, the gray guardians would take on an entirely different role. They would be “more vertical”—up-and-down bands centered on dots in future years. There is no uncertainty in the year, only in the value of most distant object. And we’d imagine that that uncertainty would grow as the year does. We also know that the low point of this uncertainty can *never* fall below the already known most distant object.

Conclusion: the graph is a dismal failure. But its failures are very, very, very common. See *Uncertainty: The Soul of Probability, Modeling & Statistics* for more of this type of analysis, including instruction on how to do it right.

*Homework* Find examples of time series graphs that commit at least one of these errors. Post a link to it below so that others can see.