Skip to content

Category: Class – Applied Statistics

June 5, 2018 | 7 Comments

Lovely Example of Statistics Gone Bad

The graph above (biggified version here) was touted by Simon Kuestenmacher (who posts many beautiful maps). He said “This plot shows the objects that were found to be ‘the most distant object’ by astronomers over time. With ever improving tools we can peak further and further into space.”

The original is from Reddit’s r/dataisbeautiful, a forum where I am happy to say many of the commenter’s noticed the graph’s many flaws.

Don’t click over and don’t read below. Take a minute to first stare at the pic and see if you can see its problems.

Don’t cheat…

Try it first…

Problem #1: The Deadly Sin of Reification! The mortal sin of statistics. The blue line did not happen. The gray envelope did not happen. What happened where those teeny tiny too small block dots, dots which fade into obscurity next to the majesty of the statistical model. Reality is lost, reality is replaced. The model becomes realer than reality.

You cannot help but be drawn to the continuous sweet blue line, with its guardian gray helpers, and think to yourself “What smooth progress!” The black dots become a distraction, an impediment, even. They soon disappear.

Problem #1 one leads to Rule #1: If you want to show what happened, show what happened. The model did not happen. Reality happened. Show reality. Don’t show the model.

It’s not that models should never be examined. Of course they should. We want good model fits over past data. But since good models fits over past data are trivial to obtain—they are even more readily available than student excuses for missing homework—showing your audience the model fit when you want to show them what happened misses the point.

Of course, it’s well to separately show model fit when you want to honestly admit to model flaws. That leads to—

Problem #2: Probability Leakage! What’s the y-axis? “Distance of furthest object (parsecs).” Now I ask you: can the distance of the furthest object in parsecs be less than 0? No, sir, it cannot. But does both the blue line and gray guardian drop well below 0? Yes, sir, they do. And does that imply the impossible happened? Yes: yes, it does.

The model has given real and substantial probability to events which could not have happened. The model is a bust, a tremendous failure. The model stinks and should be tossed.

Probability leakage is when a model gives positive probability to events we know are impossible. It is more common than you think. Much more common. Why? Because people choose the parametric over the predictive, when they should choose predictive over parametric. They show the plus-or-minus uncertainty in some who-cares model parameters and do not show, or even calculate, the uncertainty in the actual observable.

I suspect that’s the case here, too. The gray guardians are, I think, the uncertainty in the parameter of the model, perhaps some sort of smoother or spline fit. They do not show the uncertainty in the actual distance. I suspect this because the gray guardian shrinks to near nothing at the end of the graph. But, of course, there must still be some healthy uncertainty in the model distant objects astronomers will find.

Parametric uncertainty, and indeed even parameter estimation, are largely of no value to man or beast. Problem #2 leads to Rule #2: You made a model to talk about uncertainty in some observable, so talk about uncertainty in the observable and not about some unobservable non-existent parameters inside your ad hoc model. That leads to—

Problem #3: We don’t know what will happen! The whole purpose of the model should have been to quantify uncertainty in the future. By (say) the year 2020, what is the most likely distance for the furthest object? And what uncertainty is there in that guess? We have no idea from this graph.

We should, too. Because every statistical model has an implicit predictive sense. It’s just that most people are so used to handling models in their past-fit parametric sense, that they always forget the reason the created the model in the first place. And that was because they were interested in the now-forgotten observable.

Problem #3 leads to Rule #3: always show predictions for observables never seen before (in any way). If that was done here, the gray guardians would take on an entirely different role. They would be “more vertical”—up-and-down bands centered on dots in future years. There is no uncertainty in the year, only in the value of most distant object. And we’d imagine that that uncertainty would grow as the year does. We also know that the low point of this uncertainty can never fall below the already known most distant object.

Conclusion: the graph is a dismal failure. But its failures are very, very, very common. See Uncertainty: The Soul of Probability, Modeling & Statistics for more of this type of analysis, including instruction on how to do it right.

Homework Find examples of time series graphs that commit at least one of these errors. Post a link to it below so that others can see.

May 21, 2018 | 11 Comments

Choose Predictive Over Parametric Every Time

Gaze and wonder at picture which heads this article, which I lifted from John Haman’s nifty R package ciTools.

The numbers in the plot are made up whole cloth to demonstrate the difference between parameter-centered versus predictive-centered analysis. The code for doing everything is listed under “Poisson Example”.

The black dots are the made up data, the central dark line the result of the point estimate of a Poisson regression of the fictional x and y. The darker “ribbon” (from ggplot2) is the frequentist confidence interval around that point estimate. Before warning against confidence intervals, which every frequentist alive interprets in a Bayesian sense every time, because frequentism fails as a philosophy of probability (see this), look at the wider lighter ribbon, which is the 95% frequentist prediction interval, which again every frequentist interprets in the Bayesian sense every time.

The Bayesian interpretation is that, for the confidence (called “credible” in Bayesian theory) interval, there is a 95% the point estimate will fall inside the ribbon—given the data, the model, and, in this case, the tacit “flat” priors around the parameters. It’s a reasonable interpretation, and written in plain English.

The frequentist interpretation is that, for any confidence interval anywhere and anytime all that you can say is that the Platonic “true” value is in the interval or it is not. You may not assign any probability or real-life confidence that the true value is in the interval. It’s all or nothing—always. Same interpretation for the prediction interval.

It’s that utter uselessness of the frequentist interpretation that everybody switches to Bayesian mode when confronted by any confidence (credible) or prediction interval. And so we shall too.

The next and most important thing to note is that, as you might expect, the prediction bounds are very much greater than the parametric bounds. The parametric bounds represent uncertainty of a parameter inside the model. The prediction bounds represent uncertainty in the observables; i.e. what will happen in real life.

Now almost every report of results which use statistics use parametric bounds to convey uncertainty in those results. But people who read statistical results think in terms of observables (which they should). They therefore wrongly assume that the narrow uncertainty in the report applies to real life. It does not.

You can see from Haman’s toy example that, even when everything is exactly specified and known, the predictive uncertainty is three to four times the parametric uncertainty. The more realistic Quasi-Poisson example of Haman’s (which immediately follows) even better represents actual uncertainty. (The best example is a model which uses predictive probabilities and which is verified against actual observables never ever seen before.)

The predictive approach, as I often say, answers the questions people have. If my x is this, what is the probability y is that? That is what people want to know. They do not care about how a parameter inside an ad hoc model behaves. Any decisions made using the parametric uncertainty will therefore be too certain. (Unless in the rare case one is investigating parameters.)

So why doesn’t everybody use predictive uncertainty instead of parametric? If it’s so much better in every way, why stick with a method that necessarily gives too-certain results.

Habit, I think.

Do a search for (something like) “R generalized linear models prediction interval” (this assumes a frequentist stance). You won’t find much, except the admission that such things are not readily available. One blogger even wonders “what a prediction ‘interval’ for a GLM might mean.”

What they mean (in the Bayesian sense) is that, given the model and observations (and the likely tacit assumption of flat priors), if x is this, the probability y is that is p. Simplicity itself.

Even in the Bayesian world, with JAGS and so forth, there is not an automatic response to thinking about predictions. The vast, vast majority of software is written under the assumption one is keen on parameters and not on real observables.

The ciTools can be used for a limited range of generalized linear models. What’s neat about it is the coding requirements are almost none. Create the model, create the scenarios (the new x), then ask for the prediction bounds. Haman even supplies lots of examples of slick plots.

Homework: The obvious. Try it out. And then try it on data where you only did ordinary parametric analysis and contrast it with the predictive analysis. I promise you will be amazed.

April 24, 2018 | 12 Comments

Correlation of Non-Procreative Sex & Lack of Traditional Religion

Gallup has published two new polls. The first estimates the percent of those desiring non-procreative sex in each state. The second guesses the percent of non-affiliation with traditional religion (Christianity). We can learn some simple statistics by examining both together.

The first poll is “Vermont Leads States in LGBT Identification“, which is slightly misleading. Vermont comes in at 5.3% sexually non-procreative, but Washington DC is a whopping (and unsurprising) 8.6%. South Dakota is the most procreative (relatively speaking) state at only 2%.

This assumes, as all these polls do, that everybody tells the truth. That’s a suspect assumption here—in both directions. People in more traditional places might be reluctant to admit desiring non-procreative sex, while those in hipper locales might be too anxious. So, there is a healthy plus-or-minus attached to official numbers. Gallup puts this at +/- 0.2 to 1.6 percent, depending on the sample from each state. But that’s only the mathematical uncertainty, strictly dependent on model assumptions. It does not include lying, which must bump up the numbers. By how much nobody knows.

Poll number two is “The Religious Regions of the U.S.“, which is “based on how important people say religion is to them and how often they attend religious services.” Make that traditional religious services. The official religion of the State is practiced by many, though they usually don’t admit to that religion being a religion, and those who say they don’t attend services may still dabble in yoga, equality, and so forth. This makes the best interpretation of “not religious” as used in the poll as “not traditionally religious”, which is to say, not Christian (for most the country). The official +/- are 3-6%, depending on the state.

Here is what statisticians call a correlation:

A glance suggests that as traditional irreligion (henceforth just irreligion) increases, so too does non-procreative sex. But there is no notion of direction of cause. It’s plausible, and even confirmed in some cases, that lack of religion drives people to identify as sexually non-procreative. But it’s also possible, and also confirmed by observation, that an increase in numbers of sexually non-procreative causes others to abandon traditional religion.

Now “cause” here is used in a loose sense, as one cause of many, but a notable one. It takes more than just non-procreative sex for a person to abandon Christianity, and it takes more than abandoning Christianity to become sexually non-procreative. And, indeed, the lack of cause is also possible. Some sexually non-procreative remain religious, and most atheists are not sexually non-procreative (but see this).

All this means is that imputing cause from this plot cannot be done directly. It has to be done indirectly, with great caution, and by using evidence beyond the data of the plot. Here, the causes, if confirmed, are weak in the sense that they are only one of many. Obviously some thing or things cause a person to abandon traditional (assuming they held it!), and some thing or things cause a person to become sexually non-procreative. Religion and the presence of non-procreative sex are only one of these causes, and even not causes at all in some cases.

The best that we can therefore do is correlation. We can use the data to predict uncertainty. But in what? All 50 states plus DC have already been sampled. We don’t need to predict a state. We do not need any statistical model or technique—including hypothesis testing or wee p-values—if our interest is in states. Any hypothesis test would be badly, badly misplaced. We already know we cannot identify cause, so what would a hypothesis test tell us? Nothing.

Now states are not homogeneous. New York, for instance, is one tiny but well-populated progressive enclave appended on a massive but scarcely populated traditionalist mass (with some exceptions in the interior). If we assume the data will be relevant and valid for intra-state regions, then we can use it to predict uncertainty.

For instance, counties. If we knew a county’s percent of irreligion, we could predict the uncertainty in the percent of sexually non-procreative. Like this:

That envelope says, given all the assumptions, the old data, and assuming a regression is a reasonable approximation (with “flat priors”), there is an 80% a county’s percent sexual non-procreative would lie between the two lines, given a fixed percent irreligion. This also assumes the data are perfectly measured, which we know they are not. But since we do not know how this would add formally to the uncertainty, we have to do this informally, mentally widening the distance between the two lines by at least a couple of percent. Or by reducing that 80%.

Example: if percent irreligion is 20%, there is less than an 80% chance percent non-procreative sexually is 2.1-4.2%. And percent irreligion is 40%, there is less than an 80% chance percent non-procreative sexually is 3.1-5.2%.

These probabilities are exact given we accept the premises. We can already see, however, the model is weak; it does not explain places like DC. How would it work in San Francisco? Or Grand Rapids, Michigan?

April 10, 2018 | 16 Comments

A Beats B Beats C Beats A

Thanks to Bruce Foutch who found the video above. Transitivity is familiar with ordinary numbers. If B > A and C > B and D > C, then D > A. But only if the numbers A, B, C and D behave themselves. They don’t always, as the video shows.

What’s nice about this demonstration is the probability and not expected value ordering. Hence the “10 gazillion” joke. “Expected” is not exactly a misnomer, but it does have two meanings. The plain English definition tells you an expected value is a value you’re probably going to see sometime or another. The probability definition doesn’t match that, or matches only sometimes.

Expected value is purely a mathematical formalism. You multiply the—conditional: all probability is conditional—probability of a possible outcome by the value of that possible outcome, and then sum them up. For an ordinary die, this is 1/6 x 1 + 1/6 x 2 + etc. which equals 3.5, a number nobody will ever see on a die, hence you cannot plain-English “expect” it.

It’s good homework to calculate the probability expected value for the dice in the video. It’s better homework to calculate the probabilities B > A and C > B and D > C, and D > A.

It’s not that expected values don’t have uses, but that they are sometimes put to the wrong use. The intransitive dice example illustrates this. If you’re in a game rolling against another playing and what counts is winning then you’ll want the probability ordering. If you’re in a game and what counts is some score based on the face of the dice, then you might want to use the expected value ordering, especially if you’re going to have a chance of winning 10 gazillion dollars. If you use the expected value ordering and what counts is winning, you will in general lose if you pick one die and your opponent is allowed to pick any of the remaining three.

Homework three: can you find a single change to the last die such that it’s now more likely to beat the first die?

There are some technical instances using “estimators” for parameters inside probability models which produce intransitivity and which I won’t discuss. As regular readers know I advocate eschewing parameter estimates altogether and moving to a strictly predictive approach in probability models (see other other posts in this class category for why).

Intransitivity shows up a lot when decisions must be made. Take the game rock-paper-scissors. What counts is winning. You can think of it in this sense: each “face” of this “three-sided die” has the same value. Rock beats scissors which beats paper which beats rock. There is no single best object in the trio.

Homework four: what is the probability of one R-P-S die beating another R-P-S die? Given that, why is it that some people are champions of this game?

R-P-S dice in effect are everywhere, and of course can have more than three sides. Voting provides prime cases. Even simple votes, like where to go to lunch. If you and your workmates are presented choices as comparisons, then you could end up with a suboptimal choice.

It can even lead to indecision. Suppose it’s you alone and you rated restaurants with “weights” the probability of the dice in the video (the weights aren’t necessary; it’s the ordering that counts). Which do you choose? You’d pick B over A, C over B, and D over C. But you’d also pick A over D. So you have to pick A. But then you’d have to pick B, because B is better than A. And so on.

People “break free” of these vicious circles by adding additional decision elements, which have the effect of changing the preference ordering (adding negative elements is possible, too). “Oh, just forget it. C is closest. Let’s go.” Tastiness and price, which might have been the drivers of the ordering before, are jettisoned in favor of distance, which for true distances provides a transitive ordering.

That maneuver is important. Without a change in premises, indecision results. Since a decision was made, the premises must have changed, too.

Voting is too large a topic to handle in one small post, so we’ll come back to it. It’s far from a simple subject. It’s also can be a depressing one, as we’ll see.