Skip to content

Category: Class – Applied Statistics

April 10, 2018 | 16 Comments

A Beats B Beats C Beats A

Thanks to Bruce Foutch who found the video above. Transitivity is familiar with ordinary numbers. If B > A and C > B and D > C, then D > A. But only if the numbers A, B, C and D behave themselves. They don’t always, as the video shows.

What’s nice about this demonstration is the probability and not expected value ordering. Hence the “10 gazillion” joke. “Expected” is not exactly a misnomer, but it does have two meanings. The plain English definition tells you an expected value is a value you’re probably going to see sometime or another. The probability definition doesn’t match that, or matches only sometimes.

Expected value is purely a mathematical formalism. You multiply the—conditional: all probability is conditional—probability of a possible outcome by the value of that possible outcome, and then sum them up. For an ordinary die, this is 1/6 x 1 + 1/6 x 2 + etc. which equals 3.5, a number nobody will ever see on a die, hence you cannot plain-English “expect” it.

It’s good homework to calculate the probability expected value for the dice in the video. It’s better homework to calculate the probabilities B > A and C > B and D > C, and D > A.

It’s not that expected values don’t have uses, but that they are sometimes put to the wrong use. The intransitive dice example illustrates this. If you’re in a game rolling against another playing and what counts is winning then you’ll want the probability ordering. If you’re in a game and what counts is some score based on the face of the dice, then you might want to use the expected value ordering, especially if you’re going to have a chance of winning 10 gazillion dollars. If you use the expected value ordering and what counts is winning, you will in general lose if you pick one die and your opponent is allowed to pick any of the remaining three.

Homework three: can you find a single change to the last die such that it’s now more likely to beat the first die?

There are some technical instances using “estimators” for parameters inside probability models which produce intransitivity and which I won’t discuss. As regular readers know I advocate eschewing parameter estimates altogether and moving to a strictly predictive approach in probability models (see other other posts in this class category for why).

Intransitivity shows up a lot when decisions must be made. Take the game rock-paper-scissors. What counts is winning. You can think of it in this sense: each “face” of this “three-sided die” has the same value. Rock beats scissors which beats paper which beats rock. There is no single best object in the trio.

Homework four: what is the probability of one R-P-S die beating another R-P-S die? Given that, why is it that some people are champions of this game?

R-P-S dice in effect are everywhere, and of course can have more than three sides. Voting provides prime cases. Even simple votes, like where to go to lunch. If you and your workmates are presented choices as comparisons, then you could end up with a suboptimal choice.

It can even lead to indecision. Suppose it’s you alone and you rated restaurants with “weights” the probability of the dice in the video (the weights aren’t necessary; it’s the ordering that counts). Which do you choose? You’d pick B over A, C over B, and D over C. But you’d also pick A over D. So you have to pick A. But then you’d have to pick B, because B is better than A. And so on.

People “break free” of these vicious circles by adding additional decision elements, which have the effect of changing the preference ordering (adding negative elements is possible, too). “Oh, just forget it. C is closest. Let’s go.” Tastiness and price, which might have been the drivers of the ordering before, are jettisoned in favor of distance, which for true distances provides a transitive ordering.

That maneuver is important. Without a change in premises, indecision results. Since a decision was made, the premises must have changed, too.

Voting is too large a topic to handle in one small post, so we’ll come back to it. It’s far from a simple subject. It’s also can be a depressing one, as we’ll see.

April 5, 2018 | 20 Comments

The Gremlins Of MCMC: Or, Computer Simulations Are Not What You Think

I don’t think we’re clear on what simulation is NOT. RANDOMNESS IS NOT NECESSARY, for the simple reason randomness is merely a state of knowledge. Hence this classic post from 12 June 2017.

“Let me get this straight. You said what makes your car go?”

“You heard me. Gremlins.”

“Grelims make your car go.”

“Look, it’s obvious. The cars runs, doesn’t it? It has to run for some reason, right? Everybody says that reason is gremlins. So it’s gremlins. No, wait. I know what you’re going to say. You’re going to say I don’t know why gremlins make it go, and you’re right, I don’t. Nobody does. But it’s gremlins.”

“And if I told you instead your car runs by a purely mechanical process, the result of internal combustion causing movement through a complex but straightforward process, would that interest you at all?”

“No. Look, I don’t care. It runs and that it’s gremlins is enough explanation for me. I get where I want to go, don’t I? What’s the difference if it’s gremlins or whatever it is you said?”


That form of reasoning is used by defenders of simulations, a.k.a. Monte Carlo or MCMC methods (the other MC is for Markov Chain), in which gremlins are replaced by “randomness” and “draws from distributions.” Like the car run by gremlins, MCMC methods get you where you want to go, so why bother looking under the hood for more complicated explanations? Besides, doesn’t everybody agree simulations work by gremlins—I mean, “randomness” and “draws”?

Here is an abbreviated example from Uncertainty which proves it’s a mechanical process and not gremlins or randomness that accounts for the succeess of MCMC methods.

First let’s use gremlin language to describe a simple MCMC example. Z, I say, is “distributed” as a standard normal, and I want to know the probability Z is less than -1. Now the normal distribution is not an analytic equation, meaning I cannot just plug in numbers and calculate an answer. There are, however, many excellent approximations to do the job near enough, meaning I can with ease calculate this probability to reasonable accuracy. The R software does so by typing pnorm(-1), and which gives -0.1586553. This gives us something to compare our simulations to.

I could also get at the answer using MCMC. To do so I randomly—recall we’re using gremlin language—simulate a large number of draws from a standard normal, and count how many of these simulations are less than -1. Divide that number by the total number of simulations, and there is my approximation to the probability. Look into the literature and you will discover all kinds of niceties to this procedure (such as computing how accurate the approximation is, etc.), but this is close enough for us here. Use the following self-explanatory R code:

n = 10000
z = rnorm(n)
sum(z < -1)/n

I get 0.158, which is for applications not requiring accuracy beyond the third digit peachy keen. Play around with the size of n: e.g., with n = 10, I get for one simulation 0.2, which is not so hot. In gremlin language, the larger the number of draws the closer will the approximation "converge" to the right answer.

All MCMC methods are the same as this one in spirit. Some can grow to enormous complexity, of course, but the base idea, the philosophy, is all right here. The approximation is seen as legitimate not just because we can match it against an near-analytic answer, because we can't do that for any situation of real interest (if we could, we wouldn't need simulations!). It is seen as legitimate because of the way the answer was produced. Random draws imbued the structure of the MCMC "process" with a kind of mystical life. If the draws weren't random---and never mind defining what random really means---the approximation would be off, somehow, like in a pagan ceremony where somebody forgot to light the black randomness candle.

Of course, nobody speaks in this way. Few speak of the process at all, except to say it was gremlins; or rather, "randomness" and "draws". It's stranger still because the "randomness" is all computer-generated, and it is known computer-generated numbers aren't "truly" random. But, somehow, the whole thing still works, like the randomness candle has been swapped for a (safer!) electric version, and whatever entities were watching over the ceremony were satisfied the form has been met.


Now let's do the whole thing over in mechanical language and see what the differences are. By assumption, we want to quantify our uncertainty in Z using a standard normal distribution. We seek Pr(Z < -1 | assumption). We do not say Z "is normally distributed", which is gremlin talk. We say our uncertainty in Z is represented using this equation by assumption.

One popular way of "generating normals" (in gremlin language) is to use what's called a Box-Muller transformation. Any algorithm which needs "normals" can use this procedure. It starts by "generating" two "random independent uniform" numbers U_1 and U_2 and then calculating this creature:

Z = \sqrt{-2 \ln U_1} \cos(2 \pi U_2),

where Z is now said to be "standard normally distributed." We don't need to worry about the math, except to notice that it is written as a causal, or rather determinative, proposition: ``If U_1 is this and U_2 is that, Z is this with certainty." No uncertainty enters here; U_1 and U_2 determine Z. There is no life to this equation; it is (in effect) just an equation which translates a two-dimensional straight line on the interval 0 to 1 (in 2-D) to a line with a certain shape which runs from negative infinity to positive infinity.

To get the transformation, we simply write down all the numbers in the paired sequence (0.01, 0.01), (0.01, 0.02), ..., (0.99, 0.99). The decision to use two-digit accuracy was mine, just as I had to decide n above. This results in a sequence of pairs of numbers (U_1, U_2) of length 9801. For each pair, we apply the determinative mapping of (U_1, U_2) to produce Z as above, which gives (3.028866, 3.010924, ..., 1.414971e-01). Here is the R code (not written for efficiency, but transparency):

ep = 0.01 # the (st)ep
u1 = seq(ep, 1-ep, by = ep) # gives 0.01, 0.02, ..., 0.99
u2 = u1

z = NA # start with an empty vector
k = 0 # just a counter
for (i in u1){
for (j in u2){
k = k + 1
z[k] = sqrt(-2*log(i))*cos(2*pi*j) # the transformation
z[1:10] # shows the first 10 numbers of z

The first 10 numbers of Z map to the pairs (0.01, 0.01), (0.02, 0.01), (0.03, 0.01), ..., (0.10, 0.01). There is nothing at all special about the order in which the (U_1, U_2) pairs are input. In the end, as long as the "grid" of numbers implied by the loop are fed into the formula, we'll have our Z. We do not say U_1 and U_2 are "independent". That's gremlin talk. We speak of Z is purely causal terms. If you like, try this:


We have not "drawn" from any distribution here, neither uniform or normal. All that has happened is some perfectly simple math. And there is nothing "random". Everything is determined, as shown. The mechanical approximation is got the same way:

sum(z < -1)/length(z) # the denominator counts the size of z

which gives 0.1608677, which is a tad high. Try lowering ep, which is to say, try increasing the step resolution and see what that does. It is important to recognize the mechanical method will always give the same answer (with same inputs) regardless of how many times we compute it. Whereas the MCMC method above gives different numbers. Why?

Gremlins slain

Here is the gremlin R code, which first "draws" from "uniforms", and then applies the transformation. The ".s" are to indicate simulation.

n = 10000
u1.s = runif(n)
u2.s = runif(n)
z.s = sqrt(-2*log(u1.s))*cos(2*pi*u2.s)
sum(z.s < -1)/n

The first time I ran this, I got 0.1623, which is much worse than the mechanical, but the second I got 0.1589 which is good. Even in the gremlin approach, though, there is no "draw" from a normal. Our Z is still absolutely determined from the values of (u1.s, u2.s). That is, even in the gremlin approach, there is at least one mechanical process: calculating Z. So what can we say about (u1.s, u2.s)?

Here is where it gets interesting. Here is a plot of the empirical cumulative distribution of U_1 values from the mechanical procedure, overlaid with the ECDF of u1.s in red. It should be obvious the plots for U_2 and u2.s will be similar (but try!). Generate this yourself with the following code:

plot(ecdf(u1),xlab="U_1 values", ylab="Probability of U1 < value", xlim=c(0,1),pch='.') lines(ecdf(u1.s), col=2) abline(0,1,lty=2)

The values of U_1 are a rough step function; after all, there are only 99 values, while u1.s is of length n = 10000.

Do you see it yet? The gremlins have almost disappeared! If you don't see it---and do try and figure it out before reading further---try this code:


This gives the first 20 values of the "random" u1.s sorted from low to high. The values of U_1 were 0.01, 0.02, ... automatically sorted from low to high.

Do you see it yet? All u1.s is is a series of ordered numbers on the interval from 1-e6 to 1 - 1e-6. And the same for u2.s. (The 1e-6 is R's native display resolution for this problem; this can be adjusted.) And the same for U_1 and U_2, except the interval is a mite shorter! What we have are nothing but ordinary sequences of numbers from (roughly) 0 to 1! Do you have it?

The answer is: The gremlin procedure is identical to the mechanical!

Everything in the MCMC method was just as fixed and determined as the other mechanical method. There was nothing random, there were no draws. Everything was simple calculation, relying on an analytic formula somebody found that mapped two straight lines to one crooked one. But the MCMC method hides what's under the hood. Look at this plot (with the plot screen maximized; again, this is for transparency not efficiency):

plot(u1.s,u2.s, col=2, xlab='U 1 values',ylab='U 2 values')
u1.v = NA; u2.v = NA
k = 0
for (i in u1){
for (j in u2){
k = k + 1
u1.v[k] = i
u2.v[k] = j
points(u1.v,u2.v,pch=20) # these are (U_1, U_2) as one long vector of each

The black dots are the (U_1, U_2) pairs and the red the (u1.s, u2.s) pairs fed into the Z calculation. The mechanical is a regular gird and the MCMC-mechanical is also a (rougher) grid. So it's no wonder they give the same (or similar) answers: they are doing the same things.

The key is that the u1.s and u2.s themselves were produced by a purely mechanical process as well. R uses a formula no different in spirit for Z above, which if fed the same numbers always produces the same output (stick in known W which determines u1.s, etc.). The formula is called a "pseudorandom number generator", whereby "pseudorandom" they mean not random; purely mechanical. Everybody knows this, and everybody knows this, too: there is no point at which "randomness" or "draws" ever comes into the picture. There are no gremlins anywhere.

Now I do not and in no way claim that this grunt-mechanical, rigorous-grid approach is the way to handle all problems or that it is the most efficient. And I do not say the MCMC car doesn't get us where we are going. I am saying, and it is true, there are no gremlins. Everything is a determinate, mechanical process.

So what does that mean? I'm glad you asked. Let's let the late-great ET Jaynes give the answer. "It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought."

We can believe in gremlins if we like, but we can do better if we understand how the engine really works.

There's lots more details, like the error of approximation and so forth, which I'll leave to Uncertainty (which does not have any code).

Bonus code

The value of -1 was nothing special. We can see the mechanical and MCMC procedures produce normal distributions which match almost everywhere. To see that, try this code:

plot(ecdf(z),xlab="Possible values of Z", ylab="Probability of Z < value", main="A standard normal") s = seq(-4,4,by=ep) lines(s,pnorm(s),lty=2,col=2) lines(ecdf(z.s),lty=3,col=3)

This is the (e)cdf of the distributions: mechanical Z (black solid), gremlin (green dot-dashed), analytic approximation (red dashed). The step in the middle is from the crude step in the mechanical. Play with the limits of the axis to "blow up" certain sections of the picture, like this:

plot(ecdf(z),xlab="Possible values of Z", ylab="Probability of Z < value", main="A standard normal", xlim=c(-1,1)) s = seq(-4,4,by=ep) lines(s,pnorm(s),lty=2,col=2) lines(ecdf(z.s),lty=3,col=3)

Try xlim=c(-4,-3) too.


Find the values of U_1 and U_2 that correspond to Z = -1. Using the modern language, what can you say about these values in relation to the (conditional!) probability Z < -1? Think about the probabilities of the Us.

What other simple transforms can you find that correspond to other common distributions? Try out your own code for these transforms.

March 6, 2018 | 35 Comments

The Solution To The Doomsday Argument

Today, a classic re-post which deserves to be far better known.

The Doomsday Argument! No, not global warming. The one that predicts the total number of humans who will ever live. It’s also called the Carter catastrophe; the same Carter famous for the anthropic principle. Here’s the Wiki article (HT to reader Nate West).

To solve this problem, the only rule we need is this: All probability is conditional—and conditional only on the information provided. The idea is that you’re born, you notice your birth, and you reason that your place in the order of all human births is nothing special. From that, can we conclude how many more of us we expect? This situation is analogous, at first, to balls in a bag.

Our evidence is X = “There are N balls labeled 1 through N in a bag, from which only one will be removed.” The probability of Y = “The ball will have label j, where j is from 1 to N inclusive” is 1/N, via the statistical syllogism. We deduce via the language used that N is finite (no bag can hold an infinite amount of any real thing).

Reach into the bag and pull out the ball B. It will have a label; call it B = j. Our evidence is now augmented: we have in toto X’ = “X and The ball has label j”. What can we say about N? Well, given X’, the probability N is less than j is 0, and the probability N is at least j is 1, both of which are obvious. But what about these interesting and relevant probabilities (both given X’, naturally): “N equals j”, “N is greater than j”?

We do not know.

Why? Because there is no information in X or X’ about the possible values of N, except that N must be at least equal to j (given X and not X’), information which is deduced. Now mentally you might add information that is not provided, by, say, thinking to yourself, “This j is awfully low and that’s such a big bag; therefore, surely N is large.” Or “I know this Briggs, who is a trickster. He made the bag big on purpose. N is small.” Or anything, endlessly. None of these additions are part of the problem (the stated evidence), however, and all such moves are “illegal” in probability. You cannot use information not provided. It is against the law!

Now suppose we legally augment our X and, for fun, say that N is some number in the set S. We don’t need to know much about S, except that it exists, is finite, and contains only natural numbers. Thus, X now equals “There are N balls labeled 1 through N in a bag, from which only one will be removed; and N is a number in the set S.” Given X, the probability “N is s_i (one of the set S)” is 1/#S, where “#S” stands for the number of elements in S (its cardinality, if you like); thus, the probability “N = s_i” is 1/#S, where I’ll assume the s_i are increasing in i. What about the probability that the ball withdrawn has label j? Here it gets tricky, so let’s be careful.

The key lies in realizing the bounds of j are between 1 and the largest value of S. First suppose N = s_1. We want:

Pr(B = j | N = s_1, X).

This is 1/s_1 for j = 1 to s_1, and 0 for all those j up to s_I (the largest value of S). Now

Pr(B = j | N = s_2, X)

equals 1/s_2 for j = 1 to s_2, and 0 for all values up to s_I. From this, we notice we have to be careful about specifying j precisely. From total probability we know

Pr(B = j | X ) = Pr(B = j | N = s_1, X) * Pr(N=s_1|X) + … + Pr(B = j | N = s_I, X) * Pr(N=s_I|X)

and where knowledge of j is relevant to the probability. If j = 1, then

Pr(B = 1 | X ) = [(1/s_1) + … + (1/s_I)] * (1/#S)

but if j a number larger than, say, s_1 but smaller than s_2, then (call this j’)

Pr(B = j’ | X ) = [0 + (1/s_2) + … + (1/s_I)] * (1/#S)

and so forth for other j (don’t forget S is known).

The ball is withdrawn and B = j. Can we now say anything more about N? As before, there is 0 probability N is less than j, and so if j is greater than some s_i, there is 0 probability N equals those s_i. We can do more, using the good reverend’s rule, but it’s still tricky:

Pr(N = s_i | B = j, X) = Pr( B = j | N = s_i, X) * Pr( N = s_i | X) / Pr( B = j | X).

First suppose j = 1, then

Pr(N = s_i | B = 1, X) = [(1/s_i) * (1/#S)] / ([(1/s_1) + … + (1/s_I)] * (1/#S))

     = (1/s_i) / [ 1/s_1 + 1/s_2 + … + 1/s_I]

If you stare at that fraction for a moment, and recalling that the s_i are given in increasing number, you realize that values of smaller N are more probable than larger values. As a for-instance, suppose S = {20,21,…,40}, which has cardinality 21. Given X, the probability “B = 1″ is (1/20 + 1/21 + … + 1/40) * (1/21) = 0.02761295. Thus Pr(N = 20 | B = 1, X) = 0.04416451, Pr(N = 21 | B = 1, X) = 0.04206144, etc. out to Pr(N = 40 | B = 1, X) = 0.01472150. Notice that these probabilities do not change for j between 1 and 20.

In this same example, next let j = 21, then

Pr(N = s_i | B = 21, X) = Pr( B = 21 | N = s_i, X) * Pr( N = s_i | X) / Pr( B = 21 | X).

For “N = 20″, the first term on the right equals 0, and so Pr(N = s_i | B = 21, X) = 0, as desired. For “N = 21″, we have

Pr(N = 21 | B = 21, X) = Pr( B = 21 | N = 21, X) * Pr( N = 21 | X) / Pr( B = 21 | X).

     = [ (1/21) * (1/21) ] / ([0 + 1/21 + 1/22 + … + 1/40] * (1/21))

     = (1/21) / [0 + 1/21 + 1/22 + … + 1/40] = 0.06994537,

and for Pr(N = 22 | B = 21, X) = 0.06676604, out to Pr(N = 40 | B = 21, X) = 0.03672132.

Collecting all these tidbits leads to the conclusion that smaller (but not impossible) values of N are always more likely than larger, regardless of the value of j. Why? That’s easy. Before we see B, the possible values of N are s_1, s_2, and so on up to S_I, each equally likely. After we see B, some values of N (from S) might now be impossible, but since j will always be less than any remaining possible larger members of S, smaller values of N are closer to j than larger, thus smaller values are more likely. Simple as that.

What does this have to do with Doomsday? Everything. The crucial step was in conjuring the set S. Where did that come from? I made it up. S was known throughout second part of the calculations and unknown through the first part. When S was unknown, N was unknown, and there was nothing we could say about N except that it had to be as large as j. I mean nothing in its literal, logical sense. In that case, given only that you witness your birth order, your B = j that is, we are blind about the future of humanity.

When S was known, we had a rough idea of what N was, which we tightened slightly by learning where N might not be (by removing the ball). But for an S with large cardinality, we aren’t learning much by viewing B. S is what we started with, and something very like S is what we ended with. But this is cheating because I made the S up. We wanted N, of which we are ignorant, and then we pretend we know an S that tells us something but not everything about N! All the other solutions to the Doomsday argument I have seen also make up S, but then they add an extra layer of cheating. We posited a discrete finite S, from which deduced that N might equal any of its members with equal probability (before seeing B). But those who conjure up more creative S often fix the set so that smaller values of S are more likely (hence smaller values of N are more likely, even before we see B). Some form of exponential “distribution” for S is popular. Some even use non-probability arguments (called “improper priors”), which is triply cheating.

Once S is fixed, however it is fixed, the calculations flow in the same manner as above, but it’s easy to see that smaller values of N are always going to be more likely than larger, and that’s because the j will always be smaller (or no greater) than the maximum value of S. And given that some let S toodle out to infinity, it’s no shock at all to discover that N is not expected to be big.

Thus the Doomsday Argument is really a non-problem which includes its own answer in its formulation, which is cheating. Of course, it makes perfect sense to ask the question of how many of us there will be left, but trying to discover the answer using only your birth order is doomed to failure (beyond proving that N must be at least as large as j). Since all probability is conditional on only the information supplied, many different answers for our future numbers are possible. It’s easy to think of probative information: demographics, politics, epidemics, apocalypses (rocks from the sky, Christ’s return, etc.), and on and on. (Of course, some of these sets of information may lead to the guesses people have made about S.) I do not (now) have a good answer how to use these to put uncertainty on (the real) N.

Update Bayes’s theorem isn’t all that.

The difficulty lies in misunderstanding Bayes’s theorem, which some mistakenly write like this:

Pr(N = s_i | B = j) = Pr( B = j | N = s_i) * Pr( N = s_i ) / Pr( B = j ),

where the evidence about N in X is left off (finding the denominator is no problem because Pr( B = j ) = SUM_i Pr( B = j | N = s_i) * Pr( N = s_i )). Pr( N = s_i ) is thus “naked” (and violates the rule that all probability is conditional), yet users of Bayes’s theorem are trained to posit “priors” like this, and so posit one they do. It seems, say critics of the theory, that these priors are pulled from thin air. The critics are right. It’s completely arbitrary to conjure a Pr( N = s_i ), and so the resulting Pr(N = s_i | B = j) cannot be trusted. (I have much more about this kind of thing in my forthcoming book.)

Of course, I made up my own “prior”, but referenced as being a deduction from X. The probability Pr(N = s_i | B = j, X) is thus true. The attention then focuses on X, where it belongs. Why this X? No reason at all. If we’re after the best information about N, that is what should go into X. But it has to be information that is not N itself, like my S was. My S was merely a presumption that I already knew a lot about N; it was N by proxy, but a fuzzy proxy. Cheating, like I said.

It’s not Bayes’s theorem that’s the problem. It works just fine when we supplied information in X about S. But it also worked dandy when X was just “There are N balls labeled 1 through N in a bag, from which only one will be removed.” I didn’t display the equation at the time, but it’s there. I’ll leave it as homework for you to show.

Update I’m graduating a comment I made in reply to Steve Brookline to the main post, because it highlights what I think is the central error people make in the DA. SB’s comments should be examined for orientation. I’m repeating them here in concise form.

A standard application of the DA starts by asking for this:Pr(N < 20j) (the 20 comes from the magic number in statistics). Note the missing conditions. Accepting the bare notation, then Pr(N < 20j) = Pr(N/20 < j) = Pr(j > N/20) = 1 – Pr(j <= N/20) = 1 – 0.05 = 0.95. It is said Pr(j <= N/20) = 0.05 because j is “uniform” or is “uniformly distributed”, as if probability has life. The fatal error has been made, because we notice that this result appears to hold regardless what value N or j has. But there just is no such thing as “Pr(N < 20j)”.

We have to be careful with the notation. There is no such thing as unconditional probability, and when you drop the conditions, which often makes manipulating the equations easier, you run the risk of introducing error, which is what happens in the standard doomsday argument. Here’s what we want.

Pr(N < 20*j | B = j, X) = Pr(B = j | N < 20*j, X) * P(N < 20*j | X) / Pr(B = j | X).

(For why we want this, see SB’s comments.) Now X can be anything relevant; it as least says there are balls 1 through N, but it must also say something about N (directly or implied).

Suppose X contains information that N is in the set {1, 2, …, 19}. Then Pr(N < 20*j |X) = 1 for any j. Never forget j runs from 1 to N, which is where things go awry: j is (in the classical language) dependent on N; in the new (and proper) language, knowledge of N is relevant to knowledge of j.

This is it: it appears, because of loose notation, many forget that j and N are related. Steve used the notion of cutting a string; but of course, that can only be done quantumly (i.e. discretely), so the example is the same. Knowledge of the place j where you cut depends on knowledge of the length of the string N, and vicesy versey.

You can work it out, but the result is the right-hand-side is 1/1, and thus Pr(N < 20*j | B = j, X) = 1, as expected. So right here is all the proof I need to show that at least one “prior” on N ruins that 95% finding.

Here’s another one. Suppose X says N = 20. Then Pr(N < 20*j |X) = 0 for j = 1, and Pr(N < 20*j |X) = 1 for j > 1. Again, you can work it out, but it amounts to the same thing, that Pr(N < 20*j | B = j, X) = 0 when j = 1, else it equals 1 for all other j.

Again, suppose X says N is in set {20, 21, …, 40}. Starts to get interesting. I leave this one as a homework, too.

More about the DA is in my book Uncertainty.

February 7, 2018 | 1 Comment

Free Probability-Statistics Class: Predictive Case Study 1, Part XI


We left off with comparing the standard, out-of-the-box linear regression with our multinomial predictive observable model. The great weaknesses of the regression were probability leakage (giving positive probability to impossible values) and that the normal gives densities and not probabilities. We can’t fix the leakage with this model (it’s a built-in shortcoming of this model), but we can generate probabilities.

Now the densities are predictions from the normal regression model, but they are not in a form that can be used. In order to create probabilities from densities, we need to make a decision. The densities are of course easily transformed into cumulative distributions, which are probabilities, but they will give positive probabilities to an infinity of results (all numbers along the continuum). We only care about our decision points, which for our fictional Dean are the five values 0, 1, 2, 3, 4.

The decision we need to make is in how to cut the infinity into 5 blocks. There is of course no unique way to do this. But it may be reasonable to cut the values at the midpoints of the five values. For example, given the regression, the decision probability of having a CGPA = 0 would be the regression probability of being between 0 and 0.05. For 1, it would be 0.5 to 1.5, and so on. That’s easy to do.

Mixing code and output in the obvious way:

caps = d[-length(d)] + diff(d)/2
caps = c(min(d),caps,max(d))

# output
> caps
[1] 0.0 0.5 1.5 2.5 3.5 4.0

I used caps because R has a native cut function inconvenient here (it outputs factors, and we want numbers). Let’s next recreate the picture of the predictions for good and bad students. Then we’ll add on top of it the regression probabilities.

y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4)) #bad and good student
a=predict(fit, newdata = y, type='prob')$p #our mulitnomial moels

plot(levels(x$cgpa), a[1,],type='h',xlab='College GPA',ylab='Probability',ylim=c(0,1.1*max(a)),lwd=5)
  lines(levels(x$cgpa), a[2,],type='h',lty=2,col=2,lwd=3)
  legend('topright', c("SAT =  400, HGPA = 1","SAT = 1500, HGPA = 4"),lty=1:2,col=1:2,bty='n',lwd=2) 
s1 = obs.glm(fit.n,y[1,]) # the prediction from the regression
s2 = obs.glm(fit.n,y[2,])
 m = s1$central-s2$central  # plotting up the probability DENSITIES, NOT PROBABILITIES
 s = sqrt(s1$scale^2+s2$scale^2)
 w = seq(min(s1$central,s2$central)-3*s, max(s1$central,s2$central)+3*s, length.out = 100)

# regression probabilities at the caps
p = matrix(0,2,length(levels(x$cgpa))) # storage
for (i in 1:2){
     d=pnorm(caps,b$central,b$scale) # the predictions
     p[i,] = diff(d) # d[1] prob < 0; 1-(sum(diff(d))+d[1]) prob >4

Green solid is for the bad student, regression model; blue dashed is for the good student, regression model. You can see the spikes follow, more or less, the densities. The black solid is our original multinomial model for the bad student, and the red dashed the good student.

Which model is better? There is no way to tell, not with this plot. They are simply two different predictions from two different models. The only way to gauge goodness is to—all together now—wait for new data which has never been seen or used in any way, and then compare both models’ predictions against what happened.

Nevertheless, we can gather clues and build another, different predictive model, which predicts how well our original models will perform. Be sure you understand the distinction! We have an original predictive model. Then we create a model on how well this original predictive model will perform. These are not the same models! This right here is another major departure of the predictive method over classical counterparts.

Let’s first look at the regression predictions a little more closely (mixing code and output):

> p
             [,1]        [,2]       [,3]        [,4]         [,5]
[1,] 2.152882e-01 0.568404758 0.12265228 0.002511169 4.024610e-06
[2,] 1.109019e-06 0.001036698 0.07554629 0.511387677 2.646423e-01

> rowSums(p)
[1] 0.9088604 0.8526141

The first row is the bad student (low SAT and HGPA) at each of the five CGPAs (here labeled by their matrix notation), and the second the good student (high SAT and HGPA). The rowSums(p) gives the total probability of all possibilities. This should be 1 (right?). It isn’t; it is less, and that is because of probability leakage.

You can see that leakage is conditional on the assumptions, just like probability. Leakage isn’t constant. (It also depends on how we define our caps/cut points.) We earlier guessed, looking only at the densities, that it wasn’t that bad. But it turns out to be pretty large. We’re missing about 10% probability from the first prediction, and about 15% from the second. These are the probabilities for impossible events.

The leakage does not mean the model is useless, but it is evidence the model is sub optimal. It also does not mean our multinomial model is better, but at least our multinomial model does not have leakage. The multinomial model, for instance, can give large probabilities to events that did not happen, while the leakage regression model gives decent probabilities to what happened. We’ll have to see.

And that’s what we’ll do next time. It’s too much to try and begin formal verification this week.

Homework Assume the Dean wants to do CGPAs at 0, 0.5, …, 3.5, and 4. Rerun the code from the beginning, and end with the plot seen above, complete with the regression model using the obvious cut points. Notice anything different?