The Gremlins Of MCMC: Or, Computer Simulations Are Not What You Think

“Let me get this straight. You said what makes your car go?”

“You heard me. Gremlins.”

“Grelims make your car go.”

“Look, it’s obvious. The cars runs, doesn’t it? It has to run for some reason, right? Everybody says that reason is gremlins. So it’s gremlins. No, wait. I know what you’re going to say. You’re going to say I don’t know why gremlins make it go, and you’re right, I don’t. Nobody does. But it’s gremlins.”

“And if I told you instead your car runs by a purely mechanical process, the result of internal combustion causing movement through a complex but straightforward process, would that interest you at all?”

“No. Look, I don’t care. It runs and that it’s gremlins is enough explanation for me. I get where I want to go, don’t I? What’s the difference if it’s gremlins or whatever it is you said?”

MCMC

That form of reasoning is used by defenders of simulations, a.k.a. Monte Carlo or MCMC methods (the other MC is for Markov Chain), in which gremlins are replaced by “randomness” and “draws from distributions.” Like the car run by gremlins, MCMC methods get you where you want to go, so why bother looking under the hood for more complicated explanations? Besides, doesn’t everybody agree simulations work by gremlins—I mean, “randomness” and “draws”?

Here is an abbreviated example from Uncertainty which proves it’s a mechanical process and not gremlins or randomness that accounts for the succeess of MCMC methods.

First let’s use gremlin language to describe a simple MCMC example. Z, I say, is “distributed” as a standard normal, and I want to know the probability Z is less than -1. Now the normal distribution is not an analytic equation, meaning I cannot just plug in numbers and calculate an answer. There are, however, many excellent approximations to do the job near enough, meaning I can with ease calculate this probability to reasonable accuracy. The R software does so by typing pnorm(-1), and which gives -0.1586553. This gives us something to compare our simulations to.

I could also get at the answer using MCMC. To do so I randomly—recall we’re using gremlin language—simulate a large number of draws from a standard normal, and count how many of these simulations are less than -1. Divide that number by the total number of simulations, and there is my approximation to the probability. Look into the literature and you will discover all kinds of niceties to this procedure (such as computing how accurate the approximation is, etc.), but this is close enough for us here. Use the following self-explanatory R code:


n = 10000
z = rnorm(n)
sum(z < -1)/n

I get 0.158, which is for applications not requiring accuracy beyond the third digit peachy keen. Play around with the size of n: e.g., with n = 10, I get for one simulation 0.2, which is not so hot. In gremlin language, the larger the number of draws the closer will the approximation "converge" to the right answer.

All MCMC methods are the same as this one in spirit. Some can grow to enormous complexity, of course, but the base idea, the philosophy, is all right here. The approximation is seen as legitimate not just because we can match it against an near-analytic answer, because we can't do that for any situation of real interest (if we could, we wouldn't need simulations!). It is seen as legitimate because of the way the answer was produced. Random draws imbued the structure of the MCMC "process" with a kind of mystical life. If the draws weren't random---and never mind defining what random really means---the approximation would be off, somehow, like in a pagan ceremony where somebody forgot to light the black randomness candle.

Of course, nobody speaks in this way. Few speak of the process at all, except to say it was gremlins; or rather, "randomness" and "draws". It's stranger still because the "randomness" is all computer-generated, and it is known computer-generated numbers aren't "truly" random. But, somehow, the whole thing still works, like the randomness candle has been swapped for a (safer!) electric version, and whatever entities were watching over the ceremony were satisfied the form has been met.

Mechanics

Now let's do the whole thing over in mechanical language and see what the differences are. By assumption, we want to quantify our uncertainty in Z using a standard normal distribution. We seek Pr(Z < -1 | assumption). We do not say Z "is normally distributed", which is gremlin talk. We say our uncertainty in Z is represented using this equation by assumption.

One popular way of "generating normals" (in gremlin language) is to use what's called a Box-Muller transformation. Any algorithm which needs "normals" can use this procedure. It starts by "generating" two "random independent uniform" numbers U_1 and U_2 and then calculating this creature:

Z = \sqrt{-2 \ln U_1} \cos(2 \pi U_2),

where Z is now said to be "standard normally distributed." We don't need to worry about the math, except to notice that it is written as a causal, or rather determinative, proposition: ``If U_1 is this and U_2 is that, Z is this with certainty." No uncertainty enters here; U_1 and U_2 determine Z. There is no life to this equation; it is (in effect) just an equation which translates a two-dimensional straight line on the interval 0 to 1 (in 2-D) to a line with a certain shape which runs from negative infinity to positive infinity.

To get the transformation, we simply write down all the numbers in the paired sequence (0.01, 0.01), (0.01, 0.02), ..., (0.99, 0.99). The decision to use two-digit accuracy was mine, just as I had to decide n above. This results in a sequence of pairs of numbers (U_1, U_2) of length 9801. For each pair, we apply the determinative mapping of (U_1, U_2) to produce Z as above, which gives (3.028866, 3.010924, ..., 1.414971e-01). Here is the R code (not written for efficiency, but transparency):


ep = 0.01 # the (st)ep
u1 = seq(ep, 1-ep, by = ep) # gives 0.01, 0.02, ..., 0.99
u2 = u1

z = NA # start with an empty vector
k = 0 # just a counter
for (i in u1){
for (j in u2){
k = k + 1
z[k] = sqrt(-2*log(i))*cos(2*pi*j) # the transformation
}
}
z[1:10] # shows the first 10 numbers of z

The first 10 numbers of Z map to the pairs (0.01, 0.01), (0.02, 0.01), (0.03, 0.01), ..., (0.10, 0.01). There is nothing at all special about the order in which the (U_1, U_2) pairs are input. In the end, as long as the "grid" of numbers implied by the loop are fed into the formula, we'll have our Z. We do not say U_1 and U_2 are "independent". That's gremlin talk. We speak of Z is purely causal terms. If you like, try this:

plot(z)

We have not "drawn" from any distribution here, neither uniform or normal. All that has happened is some perfectly simple math. And there is nothing "random". Everything is determined, as shown. The mechanical approximation is got the same way:

sum(z < -1)/length(z) # the denominator counts the size of z

which gives 0.1608677, which is a tad high. Try lowering ep, which is to say, try increasing the step resolution and see what that does. It is important to recognize the mechanical method will always give the same answer (with same inputs) regardless of how many times we compute it. Whereas the MCMC method above gives different numbers. Why?

Gremlins slain

Here is the gremlin R code, which first "draws" from "uniforms", and then applies the transformation. The ".s" are to indicate simulation.

n = 10000
u1.s = runif(n)
u2.s = runif(n)
z.s = sqrt(-2*log(u1.s))*cos(2*pi*u2.s)
sum(z.s < -1)/n

The first time I ran this, I got 0.1623, which is much worse than the mechanical, but the second I got 0.1589 which is good. Even in the gremlin approach, though, there is no "draw" from a normal. Our Z is still absolutely determined from the values of (u1.s, u2.s). That is, even in the gremlin approach, there is at least one mechanical process: calculating Z. So what can we say about (u1.s, u2.s)?

Here is where it gets interesting. Here is a plot of the empirical cumulative distribution of U_1 values from the mechanical procedure, overlaid with the ECDF of u1.s in red. It should be obvious the plots for U_2 and u2.s will be similar (but try!). Generate this yourself with the following code:


plot(ecdf(u1),xlab="U_1 values", ylab="Probability of U1 < value", xlim=c(0,1),pch='.') lines(ecdf(u1.s), col=2) abline(0,1,lty=2)

The values of U_1 are a rough step function; after all, there are only 99 values, while u1.s is of length n = 10000.

Do you see it yet? The gremlins have almost disappeared! If you don't see it---and do try and figure it out before reading further---try this code:

sort(u1.s)[1:20]

This gives the first 20 values of the "random" u1.s sorted from low to high. The values of U_1 were 0.01, 0.02, ... automatically sorted from low to high.

Do you see it yet? All u1.s is is a series of ordered numbers on the interval from 1-e6 to 1 - 1e-6. And the same for u2.s. (The 1e-6 is R's native display resolution for this problem; this can be adjusted.) And the same for U_1 and U_2, except the interval is a mite shorter! What we have are nothing but ordinary sequences of numbers from (roughly) 0 to 1! Do you have it?

The answer is: The gremlin procedure is identical to the mechanical!

Everything in the MCMC method was just as fixed and determined as the other mechanical method. There was nothing random, there were no draws. Everything was simple calculation, relying on an analytic formula somebody found that mapped two straight lines to one crooked one. But the MCMC method hides what's under the hood. Look at this plot (with the plot screen maximized; again, this is for transparency not efficiency):

plot(u1.s,u2.s, col=2, xlab='U 1 values',ylab='U 2 values')
u1.v = NA; u2.v = NA
k = 0
for (i in u1){
for (j in u2){
k = k + 1
u1.v[k] = i
u2.v[k] = j
}
}
points(u1.v,u2.v,pch=20) # these are (U_1, U_2) as one long vector of each

The black dots are the (U_1, U_2) pairs and the red the (u1.s, u2.s) pairs fed into the Z calculation. The mechanical is a regular gird and the MCMC-mechanical is also a (rougher) grid. So it's no wonder they give the same (or similar) answers: they are doing the same things.

The key is that the u1.s and u2.s themselves were produced by a purely mechanical process as well. R uses a formula no different in spirit for Z above, which if fed the same numbers always produces the same output (stick in known W which determines u1.s, etc.). The formula is called a "pseudorandom number generator", whereby "pseudorandom" they mean not random; purely mechanical. Everybody knows this, and everybody knows this, too: there is no point at which "randomness" or "draws" ever comes into the picture. There are no gremlins anywhere.

Now I do not and in no way claim that this grunt-mechanical, rigorous-grid approach is the way to handle all problems or that it is the most efficient. And I do not say the MCMC car doesn't get us where we are going. I am saying, and it is true, there are no gremlins. Everything is a determinate, mechanical process.

So what does that mean? I'm glad you asked. Let's let the late-great ET Jaynes give the answer. "It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought."

We can believe in gremlins if we like, but we can do better if we understand how the engine really works.

There's lots more details, like the error of approximation and so forth, which I'll leave to Uncertainty (which does not have any code).

Bonus code

The value of -1 was nothing special. We can see the mechanical and MCMC procedures produce normal distributions which match almost everywhere. To see that, try this code:

plot(ecdf(z),xlab="Possible values of Z", ylab="Probability of Z < value", main="A standard normal") s = seq(-4,4,by=ep) lines(s,pnorm(s),lty=2,col=2) lines(ecdf(z.s),lty=3,col=3)

This is the (e)cdf of the distributions: mechanical Z (black solid), gremlin (green dot-dashed), analytic approximation (red dashed). The step in the middle is from the crude step in the mechanical. Play with the limits of the axis to "blow up" certain sections of the picture, like this:

plot(ecdf(z),xlab="Possible values of Z", ylab="Probability of Z < value", main="A standard normal", xlim=c(-1,1)) s = seq(-4,4,by=ep) lines(s,pnorm(s),lty=2,col=2) lines(ecdf(z.s),lty=3,col=3)

Try xlim=c(-4,-3) too.

Homework

Find the values of U_1 and U_2 that correspond to Z = -1. Using the modern language, what can you say about these values in relation to the (conditional!) probability Z < -1? Think about the probabilities of the Us.

What other simple transforms can you find that correspond to other common distributions? Try out your own code for these transforms.

19 Comments

  1. DAV :: You beat me too it!

    My brothers both drove AMC Gremlins (ant-Markov Chains?)
    Later one drove an AMC Spirit? or Eagle?

  2. Nice, Dr. Briggs.

    Anyway, we can use a similar setup using thrown dice. Everything about thrown dice is mechanical, though we may not know all of the conditions in the environment with sufficient resolution and accuracy to be able to determine the outcome with certainty. Still, it is certain that the outcome is strictly due to mechanics. Randomness simply means that we lack the knowledge needed to make statements that are fully certain, given all of the necessary and sufficient conditions that are operating in our system.

  3. [It looks like some Gremlins removed your “close italics” HTML tag somewhere…]

    Indeed. And it cascades all the way to the end of the web page.

  4. Does William M. Briggs (Statistician to the Stars!) have the Superpowers necessary to turn off the italics? Stay tuned for the next thrilling episode.

  5. All,

    Good grief, what a dumb typo on my part. I made it not once, but twice! Fixed. Thanks!

  6. A Monte Carlo simulation is essentially a way to measure a value that can’t be found by analytic means. But, too many people thing that the Monte Carlo simulation somehow does more than this. In short, it’s a good tool to impress the naive. The focus should be on the assumptions going into the model, not on the Monte Carlo method used to evaluate the output.

  7. Well, I say this as a devout Christian, a sinner by nature but forgiven, but:

    “Look, it’s obvious. The cars runs, doesn’t it? It has to run for some reason, right? Everybody says that reason is gremlins. So it’s gremlins. No, wait. I know what you’re going to say. You’re going to say I don’t know why gremlins make it go, and you’re right, I don’t. Nobody does. But it’s gremlins.”

    “And if I told you instead your car runs by a purely mechanical process, the result of internal combustion causing movement through a complex but straightforward process, would that interest you at all?”

    A lot of atheists would posit that this is nearly the argument that they’d make against the existence of God, albeit the details of the purely mechanical process are not as well known.

  8. STEP 1: Convert your unverified hypothesis to a computer coded and still unverified hypothesis
    STEP 2: Show people the coded hypothesis and results the code was purpose-built, and often meticulously tuned, to give and hope nobody realizes (especially yourself) that it’s still the same, unverified hypothesis. Forget that nobody’s tested it in the real world and that it may in fact be entirely unknowable…like trying to determine if man killed the mammoths or not.

  9. Seems to me that, when a population is too large to examine every individual, you have to sample it. But you have to do that in a way that doesn’t add spurious information.
    The idea of a PRNG is to use a method having no obvious structure; of a uniform grid that its structure matches nothing in the population you’re studying.
    So apart from a different sampling strategy the two methods are identical.

  10. >So what does that mean? I’m glad you asked. Let’s let the late-great ET Jaynes give the answer. “It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought.”

    I’m not sure what you’re trying to say here. Are you suggesting MC methods deliver “better performance” *because* the MC output is “nonrandom” (i.e. pseudorandom) rather than truly random? Or maybe you are suggesting MC methods can be surpassed by some (thoroughly) nonrandomized method?

    It’s difficult to figure out what exactly you take yourself to have shown. You seem to be suggesting that “MC methods use randomness” corresponds to “it wuz dem gremlins”, but… how? If you have a role in a decision process for a number selected randomly from some distribution, and the decision process will fail if that role is filled by a function that is skewed relative to the target distribution, then there is no way to describe how the process achieves its results other than “by randomness”. (It shouldn’t matter whether the decision in question is being made by a computer simulation or a human being.)

    >If the draws weren’t random—and never mind defining what random really means—the approximation would be off, somehow, like in a pagan ceremony where somebody forgot to light the black randomness candle.

    In the broadest possible terms, randomness in an important part of various processes because you can draw conclusions about what you want to do from a single case and then you want to cross-check against lots of other cases to make sure that your conclusions about Case Zero weren’t driven by an anomalous feature of that case or of some small region of the possibility-space. If the draws aren’t random, you’re optimizing your solution for whatever bias happens to built into your draw-function, not for the domain where the solution will actually be put into practice.

  11. qtlatham,

    There is no such thing as randomness. Late day, so more tomorrow. But use classic post page to find links about randomness.

Leave a Comment

Your email address will not be published. Required fields are marked *