Category: Class – Applied Statistics

May 21, 2018 | 11 Comments

Choose Predictive Over Parametric Every Time

Gaze and wonder at picture which heads this article, which I lifted from John Haman’s nifty R package ciTools.

The numbers in the plot are made up whole cloth to demonstrate the difference between parameter-centered versus predictive-centered analysis. The code for doing everything is listed under “Poisson Example”.

The black dots are the made up data, the central dark line the result of the point estimate of a Poisson regression of the fictional x and y. The darker “ribbon” (from ggplot2) is the frequentist confidence interval around that point estimate. Before warning against confidence intervals, which every frequentist alive interprets in a Bayesian sense every time, because frequentism fails as a philosophy of probability (see this), look at the wider lighter ribbon, which is the 95% frequentist prediction interval, which again every frequentist interprets in the Bayesian sense every time.

The Bayesian interpretation is that, for the confidence (called “credible” in Bayesian theory) interval, there is a 95% the point estimate will fall inside the ribbon—given the data, the model, and, in this case, the tacit “flat” priors around the parameters. It’s a reasonable interpretation, and written in plain English.

The frequentist interpretation is that, for any confidence interval anywhere and anytime all that you can say is that the Platonic “true” value is in the interval or it is not. You may not assign any probability or real-life confidence that the true value is in the interval. It’s all or nothing—always. Same interpretation for the prediction interval.

It’s that utter uselessness of the frequentist interpretation that everybody switches to Bayesian mode when confronted by any confidence (credible) or prediction interval. And so we shall too.

The next and most important thing to note is that, as you might expect, the prediction bounds are very much greater than the parametric bounds. The parametric bounds represent uncertainty of a parameter inside the model. The prediction bounds represent uncertainty in the observables; i.e. what will happen in real life.

Now almost every report of results which use statistics use parametric bounds to convey uncertainty in those results. But people who read statistical results think in terms of observables (which they should). They therefore wrongly assume that the narrow uncertainty in the report applies to real life. It does not.

You can see from Haman’s toy example that, even when everything is exactly specified and known, the predictive uncertainty is three to four times the parametric uncertainty. The more realistic Quasi-Poisson example of Haman’s (which immediately follows) even better represents actual uncertainty. (The best example is a model which uses predictive probabilities and which is verified against actual observables never ever seen before.)

The predictive approach, as I often say, answers the questions people have. If my x is this, what is the probability y is that? That is what people want to know. They do not care about how a parameter inside an ad hoc model behaves. Any decisions made using the parametric uncertainty will therefore be too certain. (Unless in the rare case one is investigating parameters.)

So why doesn’t everybody use predictive uncertainty instead of parametric? If it’s so much better in every way, why stick with a method that necessarily gives too-certain results.

Habit, I think.

Do a search for (something like) “R generalized linear models prediction interval” (this assumes a frequentist stance). You won’t find much, except the admission that such things are not readily available. One blogger even wonders “what a prediction ‘interval’ for a GLM might mean.”

What they mean (in the Bayesian sense) is that, given the model and observations (and the likely tacit assumption of flat priors), if x is this, the probability y is that is p. Simplicity itself.

Even in the Bayesian world, with JAGS and so forth, there is not an automatic response to thinking about predictions. The vast, vast majority of software is written under the assumption one is keen on parameters and not on real observables.

The ciTools can be used for a limited range of generalized linear models. What’s neat about it is the coding requirements are almost none. Create the model, create the scenarios (the new x), then ask for the prediction bounds. Haman even supplies lots of examples of slick plots.

Homework: The obvious. Try it out. And then try it on data where you only did ordinary parametric analysis and contrast it with the predictive analysis. I promise you will be amazed.

April 24, 2018 | 12 Comments

Correlation of Non-Procreative Sex & Lack of Traditional Religion

Gallup has published two new polls. The first estimates the percent of those desiring non-procreative sex in each state. The second guesses the percent of non-affiliation with traditional religion (Christianity). We can learn some simple statistics by examining both together.

The first poll is “Vermont Leads States in LGBT Identification“, which is slightly misleading. Vermont comes in at 5.3% sexually non-procreative, but Washington DC is a whopping (and unsurprising) 8.6%. South Dakota is the most procreative (relatively speaking) state at only 2%.

This assumes, as all these polls do, that everybody tells the truth. That’s a suspect assumption here—in both directions. People in more traditional places might be reluctant to admit desiring non-procreative sex, while those in hipper locales might be too anxious. So, there is a healthy plus-or-minus attached to official numbers. Gallup puts this at +/- 0.2 to 1.6 percent, depending on the sample from each state. But that’s only the mathematical uncertainty, strictly dependent on model assumptions. It does not include lying, which must bump up the numbers. By how much nobody knows.

Poll number two is “The Religious Regions of the U.S.“, which is “based on how important people say religion is to them and how often they attend religious services.” Make that traditional religious services. The official religion of the State is practiced by many, though they usually don’t admit to that religion being a religion, and those who say they don’t attend services may still dabble in yoga, equality, and so forth. This makes the best interpretation of “not religious” as used in the poll as “not traditionally religious”, which is to say, not Christian (for most the country). The official +/- are 3-6%, depending on the state.

Here is what statisticians call a correlation:

A glance suggests that as traditional irreligion (henceforth just irreligion) increases, so too does non-procreative sex. But there is no notion of direction of cause. It’s plausible, and even confirmed in some cases, that lack of religion drives people to identify as sexually non-procreative. But it’s also possible, and also confirmed by observation, that an increase in numbers of sexually non-procreative causes others to abandon traditional religion.

Now “cause” here is used in a loose sense, as one cause of many, but a notable one. It takes more than just non-procreative sex for a person to abandon Christianity, and it takes more than abandoning Christianity to become sexually non-procreative. And, indeed, the lack of cause is also possible. Some sexually non-procreative remain religious, and most atheists are not sexually non-procreative (but see this).

All this means is that imputing cause from this plot cannot be done directly. It has to be done indirectly, with great caution, and by using evidence beyond the data of the plot. Here, the causes, if confirmed, are weak in the sense that they are only one of many. Obviously some thing or things cause a person to abandon traditional (assuming they held it!), and some thing or things cause a person to become sexually non-procreative. Religion and the presence of non-procreative sex are only one of these causes, and even not causes at all in some cases.

The best that we can therefore do is correlation. We can use the data to predict uncertainty. But in what? All 50 states plus DC have already been sampled. We don’t need to predict a state. We do not need any statistical model or technique—including hypothesis testing or wee p-values—if our interest is in states. Any hypothesis test would be badly, badly misplaced. We already know we cannot identify cause, so what would a hypothesis test tell us? Nothing.

Now states are not homogeneous. New York, for instance, is one tiny but well-populated progressive enclave appended on a massive but scarcely populated traditionalist mass (with some exceptions in the interior). If we assume the data will be relevant and valid for intra-state regions, then we can use it to predict uncertainty.

For instance, counties. If we knew a county’s percent of irreligion, we could predict the uncertainty in the percent of sexually non-procreative. Like this:

That envelope says, given all the assumptions, the old data, and assuming a regression is a reasonable approximation (with “flat priors”), there is an 80% a county’s percent sexual non-procreative would lie between the two lines, given a fixed percent irreligion. This also assumes the data are perfectly measured, which we know they are not. But since we do not know how this would add formally to the uncertainty, we have to do this informally, mentally widening the distance between the two lines by at least a couple of percent. Or by reducing that 80%.

Example: if percent irreligion is 20%, there is less than an 80% chance percent non-procreative sexually is 2.1-4.2%. And percent irreligion is 40%, there is less than an 80% chance percent non-procreative sexually is 3.1-5.2%.

These probabilities are exact given we accept the premises. We can already see, however, the model is weak; it does not explain places like DC. How would it work in San Francisco? Or Grand Rapids, Michigan?

April 10, 2018 | 16 Comments

A Beats B Beats C Beats A

Thanks to Bruce Foutch who found the video above. Transitivity is familiar with ordinary numbers. If B > A and C > B and D > C, then D > A. But only if the numbers A, B, C and D behave themselves. They don’t always, as the video shows.

What’s nice about this demonstration is the probability and not expected value ordering. Hence the “10 gazillion” joke. “Expected” is not exactly a misnomer, but it does have two meanings. The plain English definition tells you an expected value is a value you’re probably going to see sometime or another. The probability definition doesn’t match that, or matches only sometimes.

Expected value is purely a mathematical formalism. You multiply the—conditional: all probability is conditional—probability of a possible outcome by the value of that possible outcome, and then sum them up. For an ordinary die, this is 1/6 x 1 + 1/6 x 2 + etc. which equals 3.5, a number nobody will ever see on a die, hence you cannot plain-English “expect” it.

It’s good homework to calculate the probability expected value for the dice in the video. It’s better homework to calculate the probabilities B > A and C > B and D > C, and D > A.

It’s not that expected values don’t have uses, but that they are sometimes put to the wrong use. The intransitive dice example illustrates this. If you’re in a game rolling against another playing and what counts is winning then you’ll want the probability ordering. If you’re in a game and what counts is some score based on the face of the dice, then you might want to use the expected value ordering, especially if you’re going to have a chance of winning 10 gazillion dollars. If you use the expected value ordering and what counts is winning, you will in general lose if you pick one die and your opponent is allowed to pick any of the remaining three.

Homework three: can you find a single change to the last die such that it’s now more likely to beat the first die?

There are some technical instances using “estimators” for parameters inside probability models which produce intransitivity and which I won’t discuss. As regular readers know I advocate eschewing parameter estimates altogether and moving to a strictly predictive approach in probability models (see other other posts in this class category for why).

Intransitivity shows up a lot when decisions must be made. Take the game rock-paper-scissors. What counts is winning. You can think of it in this sense: each “face” of this “three-sided die” has the same value. Rock beats scissors which beats paper which beats rock. There is no single best object in the trio.

Homework four: what is the probability of one R-P-S die beating another R-P-S die? Given that, why is it that some people are champions of this game?

R-P-S dice in effect are everywhere, and of course can have more than three sides. Voting provides prime cases. Even simple votes, like where to go to lunch. If you and your workmates are presented choices as comparisons, then you could end up with a suboptimal choice.

It can even lead to indecision. Suppose it’s you alone and you rated restaurants with “weights” the probability of the dice in the video (the weights aren’t necessary; it’s the ordering that counts). Which do you choose? You’d pick B over A, C over B, and D over C. But you’d also pick A over D. So you have to pick A. But then you’d have to pick B, because B is better than A. And so on.

People “break free” of these vicious circles by adding additional decision elements, which have the effect of changing the preference ordering (adding negative elements is possible, too). “Oh, just forget it. C is closest. Let’s go.” Tastiness and price, which might have been the drivers of the ordering before, are jettisoned in favor of distance, which for true distances provides a transitive ordering.

That maneuver is important. Without a change in premises, indecision results. Since a decision was made, the premises must have changed, too.

Voting is too large a topic to handle in one small post, so we’ll come back to it. It’s far from a simple subject. It’s also can be a depressing one, as we’ll see.

April 5, 2018 | 23 Comments

The Gremlins Of MCMC: Or, Computer Simulations Are Not What You Think

I don’t think we’re clear on what simulation is NOT. RANDOMNESS IS NOT NECESSARY, for the simple reason randomness is merely a state of knowledge. Hence this classic post from 12 June 2017.

“Let me get this straight. You said what makes your car go?”

“You heard me. Gremlins.”

“Look, it’s obvious. The cars runs, doesn’t it? It has to run for some reason, right? Everybody says that reason is gremlins. So it’s gremlins. No, wait. I know what you’re going to say. You’re going to say I don’t know why gremlins make it go, and you’re right, I don’t. Nobody does. But it’s gremlins.”

“And if I told you instead your car runs by a purely mechanical process, the result of internal combustion causing movement through a complex but straightforward process, would that interest you at all?”

“No. Look, I don’t care. It runs and that it’s gremlins is enough explanation for me. I get where I want to go, don’t I? What’s the difference if it’s gremlins or whatever it is you said?”

MCMC

That form of reasoning is used by defenders of simulations, a.k.a. Monte Carlo or MCMC methods (the other MC is for Markov Chain), in which gremlins are replaced by “randomness” and “draws from distributions.” Like the car run by gremlins, MCMC methods get you where you want to go, so why bother looking under the hood for more complicated explanations? Besides, doesn’t everybody agree simulations work by gremlins—I mean, “randomness” and “draws”?

Here is an abbreviated example from Uncertainty which proves it’s a mechanical process and not gremlins or randomness that accounts for the succeess of MCMC methods.

First let’s use gremlin language to describe a simple MCMC example. Z, I say, is “distributed” as a standard normal, and I want to know the probability Z is less than -1. Now the normal distribution is not an analytic equation, meaning I cannot just plug in numbers and calculate an answer. There are, however, many excellent approximations to do the job near enough, meaning I can with ease calculate this probability to reasonable accuracy. The R software does so by typing pnorm(-1), and which gives -0.1586553. This gives us something to compare our simulations to.

I could also get at the answer using MCMC. To do so I randomly—recall we’re using gremlin language—simulate a large number of draws from a standard normal, and count how many of these simulations are less than -1. Divide that number by the total number of simulations, and there is my approximation to the probability. Look into the literature and you will discover all kinds of niceties to this procedure (such as computing how accurate the approximation is, etc.), but this is close enough for us here. Use the following self-explanatory R code:

 n = 10000 z = rnorm(n) sum(z < -1)/n 

I get 0.158, which is for applications not requiring accuracy beyond the third digit peachy keen. Play around with the size of n: e.g., with n = 10, I get for one simulation 0.2, which is not so hot. In gremlin language, the larger the number of draws the closer will the approximation "converge" to the right answer.

All MCMC methods are the same as this one in spirit. Some can grow to enormous complexity, of course, but the base idea, the philosophy, is all right here. The approximation is seen as legitimate not just because we can match it against an near-analytic answer, because we can't do that for any situation of real interest (if we could, we wouldn't need simulations!). It is seen as legitimate because of the way the answer was produced. Random draws imbued the structure of the MCMC "process" with a kind of mystical life. If the draws weren't random---and never mind defining what random really means---the approximation would be off, somehow, like in a pagan ceremony where somebody forgot to light the black randomness candle.

Of course, nobody speaks in this way. Few speak of the process at all, except to say it was gremlins; or rather, "randomness" and "draws". It's stranger still because the "randomness" is all computer-generated, and it is known computer-generated numbers aren't "truly" random. But, somehow, the whole thing still works, like the randomness candle has been swapped for a (safer!) electric version, and whatever entities were watching over the ceremony were satisfied the form has been met.

Mechanics

Now let's do the whole thing over in mechanical language and see what the differences are. By assumption, we want to quantify our uncertainty in Z using a standard normal distribution. We seek Pr(Z < -1 | assumption). We do not say Z "is normally distributed", which is gremlin talk. We say our uncertainty in Z is represented using this equation by assumption.

One popular way of "generating normals" (in gremlin language) is to use what's called a Box-Muller transformation. Any algorithm which needs "normals" can use this procedure. It starts by "generating" two "random independent uniform" numbers U_1 and U_2 and then calculating this creature:

Z = \sqrt{-2 \ln U_1} \cos(2 \pi U_2),

where Z is now said to be "standard normally distributed." We don't need to worry about the math, except to notice that it is written as a causal, or rather determinative, proposition: If U_1 is this and U_2 is that, Z is this with certainty." No uncertainty enters here; U_1 and U_2 determine Z. There is no life to this equation; it is (in effect) just an equation which translates a two-dimensional straight line on the interval 0 to 1 (in 2-D) to a line with a certain shape which runs from negative infinity to positive infinity.

To get the transformation, we simply write down all the numbers in the paired sequence (0.01, 0.01), (0.01, 0.02), ..., (0.99, 0.99). The decision to use two-digit accuracy was mine, just as I had to decide n above. This results in a sequence of pairs of numbers (U_1, U_2) of length 9801. For each pair, we apply the determinative mapping of (U_1, U_2) to produce Z as above, which gives (3.028866, 3.010924, ..., 1.414971e-01). Here is the R code (not written for efficiency, but transparency):

 ep = 0.01 # the (st)ep u1 = seq(ep, 1-ep, by = ep) # gives 0.01, 0.02, ..., 0.99 u2 = u1

 

z = NA # start with an empty vector k = 0 # just a counter for (i in u1){ for (j in u2){ k = k + 1 z[k] = sqrt(-2*log(i))*cos(2*pi*j) # the transformation } } z[1:10] # shows the first 10 numbers of z 

The first 10 numbers of Z map to the pairs (0.01, 0.01), (0.02, 0.01), (0.03, 0.01), ..., (0.10, 0.01). There is nothing at all special about the order in which the (U_1, U_2) pairs are input. In the end, as long as the "grid" of numbers implied by the loop are fed into the formula, we'll have our Z. We do not say U_1 and U_2 are "independent". That's gremlin talk. We speak of Z is purely causal terms. If you like, try this:

plot(z)

We have not "drawn" from any distribution here, neither uniform or normal. All that has happened is some perfectly simple math. And there is nothing "random". Everything is determined, as shown. The mechanical approximation is got the same way:

sum(z < -1)/length(z) # the denominator counts the size of z

which gives 0.1608677, which is a tad high. Try lowering ep, which is to say, try increasing the step resolution and see what that does. It is important to recognize the mechanical method will always give the same answer (with same inputs) regardless of how many times we compute it. Whereas the MCMC method above gives different numbers. Why?

Gremlins slain

Here is the gremlin R code, which first "draws" from "uniforms", and then applies the transformation. The ".s" are to indicate simulation.

n = 10000 u1.s = runif(n) u2.s = runif(n) z.s = sqrt(-2*log(u1.s))*cos(2*pi*u2.s) sum(z.s < -1)/n 

The first time I ran this, I got 0.1623, which is much worse than the mechanical, but the second I got 0.1589 which is good. Even in the gremlin approach, though, there is no "draw" from a normal. Our Z is still absolutely determined from the values of (u1.s, u2.s). That is, even in the gremlin approach, there is at least one mechanical process: calculating Z. So what can we say about (u1.s, u2.s)?

Here is where it gets interesting. Here is a plot of the empirical cumulative distribution of U_1 values from the mechanical procedure, overlaid with the ECDF of u1.s in red. It should be obvious the plots for U_2 and u2.s will be similar (but try!). Generate this yourself with the following code:

 plot(ecdf(u1),xlab="U_1 values", ylab="Probability of U1 < value", xlim=c(0,1),pch='.') lines(ecdf(u1.s), col=2) abline(0,1,lty=2) 

The values of U_1 are a rough step function; after all, there are only 99 values, while u1.s is of length n = 10000.

Do you see it yet? The gremlins have almost disappeared! If you don't see it---and do try and figure it out before reading further---try this code:

sort(u1.s)[1:20]

This gives the first 20 values of the "random" u1.s sorted from low to high. The values of U_1 were 0.01, 0.02, ... automatically sorted from low to high.

Do you see it yet? All u1.s is is a series of ordered numbers on the interval from 1-e6 to 1 - 1e-6. And the same for u2.s. (The 1e-6 is R's native display resolution for this problem; this can be adjusted.) And the same for U_1 and U_2, except the interval is a mite shorter! What we have are nothing but ordinary sequences of numbers from (roughly) 0 to 1! Do you have it?

The answer is: The gremlin procedure is identical to the mechanical!

Everything in the MCMC method was just as fixed and determined as the other mechanical method. There was nothing random, there were no draws. Everything was simple calculation, relying on an analytic formula somebody found that mapped two straight lines to one crooked one. But the MCMC method hides what's under the hood. Look at this plot (with the plot screen maximized; again, this is for transparency not efficiency):

plot(u1.s,u2.s, col=2, xlab='U 1 values',ylab='U 2 values') u1.v = NA; u2.v = NA k = 0 for (i in u1){ for (j in u2){ k = k + 1 u1.v[k] = i u2.v[k] = j } } points(u1.v,u2.v,pch=20) # these are (U_1, U_2) as one long vector of each

The black dots are the (U_1, U_2) pairs and the red the (u1.s, u2.s) pairs fed into the Z calculation. The mechanical is a regular gird and the MCMC-mechanical is also a (rougher) grid. So it's no wonder they give the same (or similar) answers: they are doing the same things.

The key is that the u1.s and u2.s themselves were produced by a purely mechanical process as well. R uses a formula no different in spirit for Z above, which if fed the same numbers always produces the same output (stick in known W which determines u1.s, etc.). The formula is called a "pseudorandom number generator", whereby "pseudorandom" they mean not random; purely mechanical. Everybody knows this, and everybody knows this, too: there is no point at which "randomness" or "draws" ever comes into the picture. There are no gremlins anywhere.

Now I do not and in no way claim that this grunt-mechanical, rigorous-grid approach is the way to handle all problems or that it is the most efficient. And I do not say the MCMC car doesn't get us where we are going. I am saying, and it is true, there are no gremlins. Everything is a determinate, mechanical process.

So what does that mean? I'm glad you asked. Let's let the late-great ET Jaynes give the answer. "It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought."

We can believe in gremlins if we like, but we can do better if we understand how the engine really works.

There's lots more details, like the error of approximation and so forth, which I'll leave to Uncertainty (which does not have any code).

Bonus code

The value of -1 was nothing special. We can see the mechanical and MCMC procedures produce normal distributions which match almost everywhere. To see that, try this code:

plot(ecdf(z),xlab="Possible values of Z", ylab="Probability of Z < value", main="A standard normal") s = seq(-4,4,by=ep) lines(s,pnorm(s),lty=2,col=2) lines(ecdf(z.s),lty=3,col=3)

This is the (e)cdf of the distributions: mechanical Z (black solid), gremlin (green dot-dashed), analytic approximation (red dashed). The step in the middle is from the crude step in the mechanical. Play with the limits of the axis to "blow up" certain sections of the picture, like this:

plot(ecdf(z),xlab="Possible values of Z", ylab="Probability of Z < value", main="A standard normal", xlim=c(-1,1)) s = seq(-4,4,by=ep) lines(s,pnorm(s),lty=2,col=2) lines(ecdf(z.s),lty=3,col=3)

Try xlim=c(-4,-3) too.

Homework

Find the values of U_1 and U_2 that correspond to Z = -1. Using the modern language, what can you say about these values in relation to the (conditional!) probability Z < -1? Think about the probabilities of the Us.

What other simple transforms can you find that correspond to other common distributions? Try out your own code for these transforms.