Philosophy

The Mysticism Of Simulations: Markov Chain Monte Carlo, Sampling, And Their Alternatives

Not a simulation.

Not a simulation.

Introit

Ever heard of somebody “simulating” normal “random” or “stochastic” variables, or perhaps “drawing” from a normal or some other distribution? Such things form the backbone of many statistical methods, including bootstrapping, Gibbs sampling, Markov Chain Monte Carlo (MCMC), and several others.

Well, it’s both right and wrong—but more wrong than right. It’s wrong in the sense that it encourages magical thinking, confuses causality, and is an inefficient use of time. It’s right that, if assiduously applied, reasonably accurate answers from these algorithms can be had.

Way it’s said to work is that “random” or “stochastic” numbers are input into some algorithm and out pops answers to some statistical question which is not analytic, which, that is, cannot be solved by pencil and paper (or could, but at too seemingly great a difficulty).

For example, one popular way of “generating normals” is to use what’s called a Box-Muller transformation. It starts by “generating” two “random” “independent” “uniform” numbers U1 and U2 and then calculating this creature:

Z = R \cos(\Theta) =\sqrt{-2 \ln U_1} \cos(2 \pi U_2) ,

where Z is now said to be “standard normally distributed.” Don’t worry if you don’t follow the math, though try because we need it for later. Point is that any algorithm which needs “normals” can use this procedure.

Look at all those scare quotes! Yet each of them is proper and indicates an instance of magical thinking, a legacy of our (frequentist) past which imagined aleatory ghosts in the machines of nature, ghosts which even haunt modern Bayesians.

Scare quotes

First, random or stochastic means unknown, and nothing more. The outcome of a coin flip is random, i.e. unknown, because you don’t know all the causes at work upon the spinning object. It is not “random” because “chance” somehow grabs the coin, has its way with it, and then deposits the coin into your hand. Randomness and chance are not causes. They are not real objects. The outcome is determined by physical forces and that’s it.

Second, there is the unfortunate, spooky tendency in probability and statistics to assume that “randomness” somehow blesses results. Nobody knows how it works; that’s why it’s magic. Yet how can unknowingness influence anything if it isn’t an ontological cause? It can’t. Yet it is felt that if the data being input to algorithms aren’t “random” then the results aren’t legitimate. This is false, but it accounts for why simulations are so often sought.

Third, since randomness is not a cause, we cannot “generate” “random” numbers in the mystical sense implied above. We can, of course, make up numbers which are unknown to some people. I’m thinking of a number between 32 and 1400: to you, the number is random, “generated”, i.e. caused, by my feverish brain. (The number is hidden in the source code of this page, incidentally.)

Fourth, there are no such thing as “uniforms”, “normals”, or any other distribution-entities. No thing in the world is “distributed uniformly” or “distributed normally” or distributed anything. Distributed-as talk is more magical thinking. To say “X is normal” is to ascribe to X a hidden power to be “normal” (or “uniform” or whatever). It is to say that magical random occult forces exist which cause X to be “normal,” that X somehow knows the values it can take and with what frequency.

This is false. The only thing we are privileged to say is things like this: “Give this-and-such set of premises, the probability X takes this value equals that”, where “that” is calculated via some distribution implied by the premises. (Ignore that the probability X takes any value for continuous distributions is always 0.) Probability is a matter of ascribable or quantifiable uncertainty, a logical relation between accepted premises and some specified proposition, and nothing more.

Practicum

Fifth, since this is what probability is, computers cannot “generate” “random” numbers. What happens, in the context of our math above, is that programmers have created algorithms which will create numbers in the interval (0,1) (notice this does not include the end points); not in a coherent way, but with reference to some complex formula. This formula which, if run long enough, will produce all the numbers between (0,1) at the resolution of the computer.

Say this is every 0.01; that is, our resolution is to the nearest hundredth. Then all the numbers 0.01, 0.02, …, 0.99 will eventually show up (many will be repeated, of course). Because they do not show up in sequence, many fool themselves into thinking the numbers are “random”, and others, wanting to hold onto the mysticism but understanding the math, call the numbers “pseudo random”, an oxymoron.

But we can sidestep all this and simply write down all the numbers in the sequence, i.e. all the numbers in (0,1)2 (since we need U1 and U2) at whatever resolution we have; this might be (0.01, 0.01), (0.01, 0.02), …, (0.99, 0.99) (this is a sequence of pairs of numbers, of length 9801). We then apply the mapping of (U1, U2) to Z as given above, which produces (3.028866, 3.010924, …, 1.414971e-01).

What it looks like is shown in the picture up top.

The upper plot are the mappings of (U1, U2) to Z, along the index of the number pairs. If you’ve understood the math above, the oscillation, size, and sign changes are obvious. Spend a few moments with this. The bottom plot shows the empirical cumulative distribution of the mapped Z (black), overlayed by the (approximate) analytic standard normal distribution (red), i.e. the true distribution to high precision.

There is tight overlap between the two, except for a slight bump or step in the ECDF at 0, owing to the crude discretization of (U1, U2). Computers can do better than the nearest hundredth. Still, the error even at this crude level is trivial. I won’t show it, but even a resolution 5 time worse (nearest 0.05; number sequence length of 361) is more than good enough for most applications (a resolution of 0.1 is pushing it).

This picture gives a straightforward, calculate-this-function analysis, with no mysticism. But it works. If what we were after was, say, “What is the probability that Z is less than -1?”, all we have to do is ask. Simple as that. There are no epistemological difficulties with the interpretation.

The built-in analytic approximation is 0.159 (this is our comparator). With the resolution of 0.01, the direct method shows 0.160, which is close enough for most practical applications. A resolution of 0.05 gives 0.166, and 0.1 gives 0.172 (I’m ignoring that we could have shifted U1 or U2 to different start points; but you get the idea).

None of these have plus or minuses, though. Given our setup (starting points for U1 and U2, the mapping function), these are the answers. There is no probability attached. But we would like to have some idea of the error of the approximation. We’re cheating here, in a way, because we know the right answer (to high degree), which we always won’t. In order to get some notion how far off that 0.160 is we’d have to do more pen-and-paper work, engaging in what might be a fair amount of numerical analysis. Of course, for many standard problems, just like in MCMC approaches, this could be worked out in advance.

MCMC etc.

Contrast this to the mystical approach. Just like before, we have to specify something like a resolution, which is the number of times we must “simulate” “normals” from a standard normal—which we then collect and form the estimate of the probability of less than -1, just as before. To make it fair, pick 9801, which is the length of the 0.01-resolution series.

I ran this “simulation” once and got 0.162; a second time 0.164; a third showed 0.152. There’s the first problem. Each run of the “simulation” gives different answers. Which is the right one? They all are; a non-satisfying but true answer. So what will happen if the “simulation” itself is iterated, say 5000 times, where each time we “simulate” 9801 “normals” and each time estimate the probability, keeping track of all 9801 estimates? Let’s see, because that is the usual procedure.

Turns out 90% of the results are between 0.153 and 0.165, with a median and mean of 0.159, which equals the right answer (to the thousandth). It’s then said there’s a 90% chance the answer we’re after is between 0.153 and 0.165. This or similar intervals are used as error bounds, which are “simulated” here but (should be) calculated mechanically above. Notice that the uncertainty in the mystical approach feels greater, because the whole process is opaque and purposely vague. The numbers seem like they’re coming out of nowhere. The uncertainty is couched probabilistically, which is distracting.

It took 19 million calculations to get us this answer, incidentally, rather than the 9801 the mechanical approach produced. But if we increased the resolution to 0.005 there, we also get 0.159 at a cost of just under 40,000 calculations. Of course, MCMC fans will discover short cuts and other optimizations to implement.

Why does the “simulation” approach work, though? It does (at some expensive) give reasonable answers. Well, if we remove the mysticism about randomness and all that, we get this picture:

Mystical versus mechanical.

Mystical versus mechanical.

The upper two plots are the results of the “simulation”, while the bottom two are the mechanical mapping. The bottom two show the empirical cumulative distribution of U1 (U2 is identical) and the subsequent ECDF of the mapped normal distribution, as before. The bump at 0 is there, but is small.

Surprise ending!

The top left ECDF shows all the “uniforms” spit out by R’s runif() function. The only real difference between this and the ECDF of the mechanical approach is that the “simulation” is at a finer resolution (the first U happened to be 0.01031144, 6 orders of magnitude finer; the U’s here are not truly plain-English uniform as they are in the mechanical approach). The subsequent ECDF of Z is also finer. The red lines are the approximate truth, as before.

But don’t forget, the “simulation” just is the mechanical approach done more often. After all, the same Box-Muller equation is used to map the “uniforms” to the “normals”. The two approaches are therefore equivalent!

Which is now no surprise: of course they should be equivalent. We could have taken the (sorted) Us from the “simulation” as if they were the mechanical grid (U1, U2) and applied the mapping, or we could have pretended the Us from the “simulation” were “random” and then applied the mapping. Either way, same answer.

The only difference (and advantage) seems to be in the built-in error guess from the “simulation”, with its consequent fuzzy interpretation. But we could have a guess of error from the mechanical algorithm, too, either by numerical analysis means as mentioned, or even by computer approximation (one way: estimate quantities using a coarse, then fine, then finest grid and measure the rate of change of the estimates; with a little analysis thrown in, this makes a fine solution).

The benefit of the mechanical approach is the demystification of the process. It focuses the mind on the math and reminds us that probability is nothing but a numerical measure of uncertainty, not a live thing which imbues “variables” with life and which by some sorcery gives meaning and authority to results.

Categories: Philosophy, Statistics

33 replies »

  1. Either your enemies are messing with the source code, or the number is a very unusual choice! I would have guessed a whole number, but then again, that’s just my model of your thought processes!

    In bootstrapping, the object seems to be to create “random” data sets from the actual data set and then run hundreds of thousands of simulations. This is supposed to be more “accurate” than other methods and can be used to claim 5 sigma significance (or very close). I really can’t see how that can work. I suppose it goes back to creating “randomness”. Anyway, thank you for the posting. I really want to understand how these statistics are being used. Global warming has fallen in love with them —presumably because of the accuracy level they can claim. I look forward to reading comments and learning.

  2. Your enemies grow bolder. They have crept in on little cat feet and stolen the latter half of the sentence:
    “Even the uncertainty is probabilistic, which reinforces the “

  3. Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number — there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method.
    John Von Neumann

  4. Ray,

    Von Neumann, God bless him, was right, and also wrong. There are no “true random” numbers in the mystical sense he implies. There are only numbers, caused to be by something, which are not known with certainty.

  5. Briggs:
    Through use of the measure theoretic definition of “measure” one arrives at the conclusion that a “probability” is a measure of an event having the property that the numerical value of this measure is 1 when this event is an example of a “unit event.” As “event” and “uncertainty” reference different concepts to say that “probability is nothing but a numerical measure of uncertainty” is incorrect.

  6. 32.32. That doesn’t seem very random! What are the odds that the number would be so close to one of the boundaries: 32 and 1400? I feel as tho’ you have mislead your readers here!!!

  7. What makes MC (and MCMC) simulations so appealing is that they’re easy to use. Sometimes ergonomics trumps an extra decimal place of precision.

  8. Will,

    Ah, ease. They do seem easier, yes. Because why? Because much effort has been expended to make them so. Whereas little work has gone into other methods, which make them seem difficult.

    But even if the ease of use of all methods were similar, the main point is to understand why. Simulation methods aren’t working by the means with which they advertise themselves. They are instead, kind of (think about the error bounds), doing the right thing using the wrong words.

    Nick,

    That’s randomness for you.

  9. This is such a great post, and I hope you follow along this theme a bit more. I had been contemplating the differences between MCMC approaches and more direct discrete approaches to Bayesian inference involving continuous distributions recently. I had thought through the lower computations part, but I was wondering why more people don’t seem to do it. Although this doesn’t address that issue directly, it does make me feel better about the discrete approach.

  10. I like the simplicity of the mechanical approach here, but, for problems with multiple unknowns to simulate, this looks like a good way to run out of memory really quickly, especially if you’re trying to do everything with apply-like functions in R. And MC algorithms are being applied on high-dimensional problems a lot these days.

  11. For some problems, MCMC may be the only tractable method. For large Bayes nets, Murphy’s Bayes Net Toolbox (or the C++ PNL version) software has Gibbs sampling (a type of MCMC) to provide estimates to queries rather than attempt exact solutions (using, say, message passing) which could take many (for some problems many, many) times longer. Not so magical just the best that can be done under the circumstances.

  12. I really like this quote by Jaynes:

    “It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought.” E.T. Jaynes

    I also really like MCMC methods. It would be very intructive (in the sense that this blogpost is very instructive) to see clearly deterministic versions of some common “randomized” procedures. Any links to papers/blogposts out there? While I easily could code up a standard Metropolis-Hasting MCMC procedure I wouldn’t know where to start if I wasn’t allowed to use and functions producing “random” samples…

  13. Rasmus,

    Point is that the two methods are already identical, as they must be if probability and randomness are measures of information. As far as other papers (besides this post), I don’t know of any. We’re after changing the philosophy which is always difficult.

    Thanks for the Jaynes quote! Spot on.

  14. Briggs: It’s much easier to simulate than it is to calculate. With a simulation you can borrow software/code/formulas from a domain expert, and then permutate the smithereens out of it. With a calculation you have to find someone who not only knows how to use the statistical tools, but also understands something about the domain they’re working in.

    It gets even stranger though at ‘big data’ scales. With enough data samples there is a point where you can treat even continuous values as categorical (10,000,000+ categories each with its own frequency of occurrence). Keep in mind that all numbers in a computers memory are discreet encodings of a particular state anyway..

    When everything being measured can be represented as a discreet, but extremely detailed, histogram, life becomes easier for everyone involved. Sure, it’s horrible overkill a lot of the time, but eventually it wont matter. Once our computers are fast enough to crunch several gigabytes of data / second, nobody will even notice.

    That is my prediction for the future. 🙂

  15. So I just found out about Quasi-Monte Carlo, sort of an attempt to “derandomize” classical monte carlo (http://en.wikipedia.org/wiki/Quasi-Monte_Carlo_method). I was thinking of how to characterize what a function generating random samples does (say rnorm) without refering to randomness. One characterization that sort of make sense to me is that it is a method of generating points, spread out as well as possible according to some density, with the constraints that the method is not allowed to have any memory of what points ( and how many) has been generated nor know how many points should be generated in total.

  16. You argue that a coin flip is unknown because of insufficient information, and not random, but what about a photon going through a double slit? I’d always thought that where the photons lands was truly random — unpredictable, no matter how much information about the present world you had.

  17. Robert Zeh,

    True, we don’t know and can’t predict. But something is causing the photon to do what it does. Just because we can’t or don’t know a cause doesn’t mean the cause doesn’t exist.

    The outcome is “random” to us, i.e. unknown or uncertain. There is no mystery or magic.

  18. Terry: No, Einstein wasn’t proven wrong. It is true that we believe there is uncertainly and randomness at a quantum level, but that does not mean the belief is true. Currently, we cannot assertain a pattern, though one could most certainly exist. Also, since virtually all of quantum mechanics is inferred and largely mathematical, it’s possible that we are totally wrong about what goes on at said level (and it may well be forever unknowable). Our inferences and models do allow us to predict some outcomes, but may be no closer to the true state of things than ancient beliefs about the macro universe were. As Briggs notes, just because we explain something does not make it true nor does our inability to explain make it false. It’s completely possible there is order and causality out there.

  19. Scotian: Good link. One never has to wonder what Motl thinks! I like that!

    YOS: This is giving me a headache…….Quantum worlds, higher mathematics, arguments over opposing theories. Wait, that’s just science. ARGGGH! Science is giving me a headache! 🙂

  20. True, we don’t know and can’t predict. But something is causing the photon to do what it does. Just because we can’t or don’t know a cause doesn’t mean the cause doesn’t exist.

    It’s more complicated than that. I’m a physicist who dabbles a bit in statistics, probably the converse of you. I like a lot of what you post, but your insistence on strict and complete causality, unfortunately, is not in keeping with actual measurements.

    Let’s use the example of radioactive decay. A radioactive nucleus has some probability of decaying in any given time interval. That probability is constant. There is no way to predict ahead of time when the nucleus will decay. You can give probabilities that it will decay before some time, but you cannot know when the decay will occur.

    “But wait,” you say. “We don’t know when the nucleus will decay because we don’t have complete information! If we had a more complete description of the system, we could predict the exact time of its decay.”

    Nope. Sorry. The above is a variant of what is known as a “hidden variables” theory, and Bell’s inequality, which has been experimentally verified, says that one cannot have a hidden-variables theory behind quantum mechanics and preserve causality. You get a choice: either you can know the state of the system perfectly and predict when it will decay, but not have causality, or you can have causality and the exact time of the decay must remain unknowable.

    As a result, the “true randomness” that you insist cannot exist does, all around us. Doing MCMC with truly random numbers gives you answers that behave much the same as those obtained with pseudo-random numbers. Which means that MCMC and the like really are useful.

  21. On quatum physics, there are two examples of why the certainty of what we do and do not know is very frequently overstated and/or claimed:

    My husband had a teacher who would put four objects in one of those small match boxes (the truly tiny ones) and seal it. The students were then to describe what was in the box without unsealing it. One could weigh it, measure the sound level the box could make being shaken, all kinds of things. One could predict probable outcomes, but never, ever know what was in the box without unsealing said box.

    Another example is the invisible critter: You find packages of potato chips chewed open, tiny footprints, food stuffed in drawers. You have, however, never ever seen the creature (nor, for this example, have you nor will you ever see a mouse or similar creature). You can determine foot size and shape, appetite perhaps, frequency of visits, etc. Even if you trap the critter, you cannot know, because he is invisible. With luck, you can trap him and maybe accidentally kill him. bump into the carcass, and actually get some accurate measurements of the ONE creature you actually know about, but you cannot extrapolate from there. If another creature comes, you will know nothing about that one.

    Quantum physics is all math, with circumstantial evidence from experiments. You can never see an atom. You can never witness what happens in one. You can only see the chewing on the bag, or measure the loudness of the matchbox when shaken. Quantum physics is not reality, but the best description we have at the moment. Newton was great for centuries, then Einstein came along and now we KNOW how things work. Sorry, not buying it.

    You stated, I believe, that current theory “can have causality and the exact time of the decay must remain unknowable.” This is an absolute truth only if we know can have causality and the exact time of the decay must remain unknowable. Otherwise, it is our current level of understanding, not the absolute truth or correct theory.

    I am curious—physicists seem to believe that math is reality. Is that what you are saying? Also, don’t statitics determine probability and yet statistics are not a large part of quantum physics? You have me very confused. (I’m also not sure your conclusion that MCMC and the like are really useful is valid—it does not seem to follow from your comment.)

  22. Sheri,

    Quantum mechanics is not all math. There are plenty of experiments you can do that will demonstrate quantum effects, here is a good example: http://teachspin.com/instruments/two_slit/index.shtml

    I was lucky enough to do these types of labs in high school physics.

    And yes, QM is the best description of reality we have going right now, and like all science it is provisional. But better to go with the best description available then hold out for a as yet unknown future description.

  23. I noted: “with circumstantial evidence from experiments”. The basis for the experiments is mathematical calculations. How can one separate quantum mechanics from advanced mathematics?

    Why are the “best description” and “unknown future descriptions” mutually exclusive? Many physicists go with what is known why working on new theories. That’s what advances science—we keep learning.

Leave a Reply

Your email address will not be published. Required fields are marked *