Skip to content

Category: Class – Applied Statistics

June 21, 2017 | 6 Comments

Free Statistics Class: Predictive Case Study 1, Part I

Regular readers know Uncertainty proposes we go back to the old way of examining and making conclusions about data, and eschew many innovations of the 20th Century. No p-values, no tests, no posteriors. Just plain probability statements about observables and a rigorous separation of probability from decision.

These criticisms you know (or ought to by now). So why not let’s do a case study or three, and take our time doing so. Case Study 1 uses the same data presented in Uncertainty. We’re interested in quantifying our uncertainty in a person’s end-of-first-year College GPA given we know their SAT score, high school GPA, and perhaps another measure we might have.

Now right off, we know we haven’t a chance to discover the cause—actually causes—of a person’s CGPA. These are myriad. A GPA is comprised of scores/grades per class, and the causes of the score in each class are multitudinous. How much one drank the evening before a quiz, how many hours put in on a term paper, whether a particular book was available at a certain time, and on and on.

It is equally obvious a person’s HGPA or SAT does not and cannot cause a person’s CGPA. Some of the same causes responsible for the HGPA, SAT might appear in the list of causes for CGPA, but it’s a stretch to say they’re identical. We could say “diligence” or “sloth” are contributory causes, but since these cannot be quantified (even though some might attempt such a maneuver), they cannot take their place in a numerical analysis.

Which brings up the excellent question: why do a numerical analysis at all?

Do no skip lightly over this. For in that query is the foundation of all we’ll do. We’re doing a numerical, as opposed to the far more common qualitative (which form most of our judgments), study because we have in mind a decision we will make. Everything we do must revolve around that decision. Since, of course, different people will make different decisions, the method of analysis would change in each case.

It should be clear the decision we cannot make is about what causes CGPA. Nor can we decide how much “influence” SAT or HGPA has on CGPA, because “influence” is a causal word. We cannot “control” for SAT or HGPA on CGPA because, again, “control” is a causal word, and anyway HGPA and SAT were in no way caused, i.e. controlled, by any experimenter.

All we can do, then, if a numerical analysis is our goal, is to say how much our uncertainty in CGPA changes given what we know about SAT or HGPA. Anything beyond that is beyond the data we have in hand. And since we can make up causal stories until the trump of doom, we can always come up with a causal explanation for what we see. But our explanation could be challenged by somebody else who has their own story. Presuming no logical contradiction (say a theory insists SAT scores that we observed are impossible), our “data” would support all causal explanations.

This point is emphasized to the point we’re sick of hearing it because the classic way of doing statistics is saturated in incorrect causal language. We’re trying to escape that baggage.

So just what decision do I want to make about CGPA?

I could be interested in my own or in another individual’s. Let’s start with that by thinking what CGPA is. Well, it’s a score. Every class, in the fictional college we’re imagining, awards a numerical grade, F (= 0) up to A+ (A = 4, A+ = 4.33, and so on). CGPA = score per class divided by number of classes. That’s several numbers we need to know.

How many classes will there be? In this data, I don’t know. That is to say, I do not know the precise number for any individual, but I do know it must be finite. Experience (which is not part of the data) says it’s probably around 10-12 for a year. But who knows? We also can infer that each person has at least one class—but it could be that some have only one class. Again, who knows?

So number of classes is equal to or greater than one and finite. So, given the scoring system for grades, that means CGPA must be of finite precision. Suppose a person has only one class, then the list of possible CGPAs is 0, 0.33, …, 4, 4.33 and none other. If a person has two classes, then the possibilities are 0, 0.165, 0.33, and so forth. However many classes there are, the final list will be a discrete, finite set of possible CGPAs, which will be known to us given the premises about the grading system.

Suppose a student had 12 classes, then his score (CGPA) might be (say) 2.334167. That’s 7 digits of precision! This number is one of lots of different possible grades (these begin with 0, 0.0275, 0.055, 0.0825, …). And there is more than one way to get some of these grades. A person with a CGPA of 2 might have had 12 classes with all C’s (= 2), or 12 with half A’s and half F’s; and there are other combinations that lead to CGPA = 2. And so now we have to ask ourselves just what about the CGPA we want to know.

We’ve reached our first branching point! And the end of today’s lesson. See if you can guess where this is going.

I’ll answer all pertinent questions, but please look elsewhere on the site (or in Uncertainty) for criticisms of classical methods. Non-pertinent objections will be ignored.

June 12, 2017 | 19 Comments

The Gremlins Of MCMC: Or, Computer Simulations Are Not What You Think

“Let me get this straight. You said what makes your car go?”

“You heard me. Gremlins.”

“Grelims make your car go.”

“Look, it’s obvious. The cars runs, doesn’t it? It has to run for some reason, right? Everybody says that reason is gremlins. So it’s gremlins. No, wait. I know what you’re going to say. You’re going to say I don’t know why gremlins make it go, and you’re right, I don’t. Nobody does. But it’s gremlins.”

“And if I told you instead your car runs by a purely mechanical process, the result of internal combustion causing movement through a complex but straightforward process, would that interest you at all?”

“No. Look, I don’t care. It runs and that it’s gremlins is enough explanation for me. I get where I want to go, don’t I? What’s the difference if it’s gremlins or whatever it is you said?”

MCMC

That form of reasoning is used by defenders of simulations, a.k.a. Monte Carlo or MCMC methods (the other MC is for Markov Chain), in which gremlins are replaced by “randomness” and “draws from distributions.” Like the car run by gremlins, MCMC methods get you where you want to go, so why bother looking under the hood for more complicated explanations? Besides, doesn’t everybody agree simulations work by gremlins—I mean, “randomness” and “draws”?

Here is an abbreviated example from Uncertainty which proves it’s a mechanical process and not gremlins or randomness that accounts for the succeess of MCMC methods.

First let’s use gremlin language to describe a simple MCMC example. Z, I say, is “distributed” as a standard normal, and I want to know the probability Z is less than -1. Now the normal distribution is not an analytic equation, meaning I cannot just plug in numbers and calculate an answer. There are, however, many excellent approximations to do the job near enough, meaning I can with ease calculate this probability to reasonable accuracy. The R software does so by typing pnorm(-1), and which gives -0.1586553. This gives us something to compare our simulations to.

I could also get at the answer using MCMC. To do so I randomly—recall we’re using gremlin language—simulate a large number of draws from a standard normal, and count how many of these simulations are less than -1. Divide that number by the total number of simulations, and there is my approximation to the probability. Look into the literature and you will discover all kinds of niceties to this procedure (such as computing how accurate the approximation is, etc.), but this is close enough for us here. Use the following self-explanatory R code:


n = 10000
z = rnorm(n)
sum(z < -1)/n

I get 0.158, which is for applications not requiring accuracy beyond the third digit peachy keen. Play around with the size of n: e.g., with n = 10, I get for one simulation 0.2, which is not so hot. In gremlin language, the larger the number of draws the closer will the approximation "converge" to the right answer.

All MCMC methods are the same as this one in spirit. Some can grow to enormous complexity, of course, but the base idea, the philosophy, is all right here. The approximation is seen as legitimate not just because we can match it against an near-analytic answer, because we can't do that for any situation of real interest (if we could, we wouldn't need simulations!). It is seen as legitimate because of the way the answer was produced. Random draws imbued the structure of the MCMC "process" with a kind of mystical life. If the draws weren't random---and never mind defining what random really means---the approximation would be off, somehow, like in a pagan ceremony where somebody forgot to light the black randomness candle.

Of course, nobody speaks in this way. Few speak of the process at all, except to say it was gremlins; or rather, "randomness" and "draws". It's stranger still because the "randomness" is all computer-generated, and it is known computer-generated numbers aren't "truly" random. But, somehow, the whole thing still works, like the randomness candle has been swapped for a (safer!) electric version, and whatever entities were watching over the ceremony were satisfied the form has been met.

Mechanics

Now let's do the whole thing over in mechanical language and see what the differences are. By assumption, we want to quantify our uncertainty in Z using a standard normal distribution. We seek Pr(Z < -1 | assumption). We do not say Z "is normally distributed", which is gremlin talk. We say our uncertainty in Z is represented using this equation by assumption.

One popular way of "generating normals" (in gremlin language) is to use what's called a Box-Muller transformation. Any algorithm which needs "normals" can use this procedure. It starts by "generating" two "random independent uniform" numbers U_1 and U_2 and then calculating this creature:

Z = \sqrt{-2 \ln U_1} \cos(2 \pi U_2),

where Z is now said to be "standard normally distributed." We don't need to worry about the math, except to notice that it is written as a causal, or rather determinative, proposition: ``If U_1 is this and U_2 is that, Z is this with certainty." No uncertainty enters here; U_1 and U_2 determine Z. There is no life to this equation; it is (in effect) just an equation which translates a two-dimensional straight line on the interval 0 to 1 (in 2-D) to a line with a certain shape which runs from negative infinity to positive infinity.

To get the transformation, we simply write down all the numbers in the paired sequence (0.01, 0.01), (0.01, 0.02), ..., (0.99, 0.99). The decision to use two-digit accuracy was mine, just as I had to decide n above. This results in a sequence of pairs of numbers (U_1, U_2) of length 9801. For each pair, we apply the determinative mapping of (U_1, U_2) to produce Z as above, which gives (3.028866, 3.010924, ..., 1.414971e-01). Here is the R code (not written for efficiency, but transparency):


ep = 0.01 # the (st)ep
u1 = seq(ep, 1-ep, by = ep) # gives 0.01, 0.02, ..., 0.99
u2 = u1

z = NA # start with an empty vector
k = 0 # just a counter
for (i in u1){
for (j in u2){
k = k + 1
z[k] = sqrt(-2*log(i))*cos(2*pi*j) # the transformation
}
}
z[1:10] # shows the first 10 numbers of z

The first 10 numbers of Z map to the pairs (0.01, 0.01), (0.02, 0.01), (0.03, 0.01), ..., (0.10, 0.01). There is nothing at all special about the order in which the (U_1, U_2) pairs are input. In the end, as long as the "grid" of numbers implied by the loop are fed into the formula, we'll have our Z. We do not say U_1 and U_2 are "independent". That's gremlin talk. We speak of Z is purely causal terms. If you like, try this:

plot(z)

We have not "drawn" from any distribution here, neither uniform or normal. All that has happened is some perfectly simple math. And there is nothing "random". Everything is determined, as shown. The mechanical approximation is got the same way:

sum(z < -1)/length(z) # the denominator counts the size of z

which gives 0.1608677, which is a tad high. Try lowering ep, which is to say, try increasing the step resolution and see what that does. It is important to recognize the mechanical method will always give the same answer (with same inputs) regardless of how many times we compute it. Whereas the MCMC method above gives different numbers. Why?

Gremlins slain

Here is the gremlin R code, which first "draws" from "uniforms", and then applies the transformation. The ".s" are to indicate simulation.

n = 10000
u1.s = runif(n)
u2.s = runif(n)
z.s = sqrt(-2*log(u1.s))*cos(2*pi*u2.s)
sum(z.s < -1)/n

The first time I ran this, I got 0.1623, which is much worse than the mechanical, but the second I got 0.1589 which is good. Even in the gremlin approach, though, there is no "draw" from a normal. Our Z is still absolutely determined from the values of (u1.s, u2.s). That is, even in the gremlin approach, there is at least one mechanical process: calculating Z. So what can we say about (u1.s, u2.s)?

Here is where it gets interesting. Here is a plot of the empirical cumulative distribution of U_1 values from the mechanical procedure, overlaid with the ECDF of u1.s in red. It should be obvious the plots for U_2 and u2.s will be similar (but try!). Generate this yourself with the following code:


plot(ecdf(u1),xlab="U_1 values", ylab="Probability of U1 < value", xlim=c(0,1),pch='.') lines(ecdf(u1.s), col=2) abline(0,1,lty=2)

The values of U_1 are a rough step function; after all, there are only 99 values, while u1.s is of length n = 10000.

Do you see it yet? The gremlins have almost disappeared! If you don't see it---and do try and figure it out before reading further---try this code:

sort(u1.s)[1:20]

This gives the first 20 values of the "random" u1.s sorted from low to high. The values of U_1 were 0.01, 0.02, ... automatically sorted from low to high.

Do you see it yet? All u1.s is is a series of ordered numbers on the interval from 1-e6 to 1 - 1e-6. And the same for u2.s. (The 1e-6 is R's native display resolution for this problem; this can be adjusted.) And the same for U_1 and U_2, except the interval is a mite shorter! What we have are nothing but ordinary sequences of numbers from (roughly) 0 to 1! Do you have it?

The answer is: The gremlin procedure is identical to the mechanical!

Everything in the MCMC method was just as fixed and determined as the other mechanical method. There was nothing random, there were no draws. Everything was simple calculation, relying on an analytic formula somebody found that mapped two straight lines to one crooked one. But the MCMC method hides what's under the hood. Look at this plot (with the plot screen maximized; again, this is for transparency not efficiency):

plot(u1.s,u2.s, col=2, xlab='U 1 values',ylab='U 2 values')
u1.v = NA; u2.v = NA
k = 0
for (i in u1){
for (j in u2){
k = k + 1
u1.v[k] = i
u2.v[k] = j
}
}
points(u1.v,u2.v,pch=20) # these are (U_1, U_2) as one long vector of each

The black dots are the (U_1, U_2) pairs and the red the (u1.s, u2.s) pairs fed into the Z calculation. The mechanical is a regular gird and the MCMC-mechanical is also a (rougher) grid. So it's no wonder they give the same (or similar) answers: they are doing the same things.

The key is that the u1.s and u2.s themselves were produced by a purely mechanical process as well. R uses a formula no different in spirit for Z above, which if fed the same numbers always produces the same output (stick in known W which determines u1.s, etc.). The formula is called a "pseudorandom number generator", whereby "pseudorandom" they mean not random; purely mechanical. Everybody knows this, and everybody knows this, too: there is no point at which "randomness" or "draws" ever comes into the picture. There are no gremlins anywhere.

Now I do not and in no way claim that this grunt-mechanical, rigorous-grid approach is the way to handle all problems or that it is the most efficient. And I do not say the MCMC car doesn't get us where we are going. I am saying, and it is true, there are no gremlins. Everything is a determinate, mechanical process.

So what does that mean? I'm glad you asked. Let's let the late-great ET Jaynes give the answer. "It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought."

We can believe in gremlins if we like, but we can do better if we understand how the engine really works.

There's lots more details, like the error of approximation and so forth, which I'll leave to Uncertainty (which does not have any code).

Bonus code

The value of -1 was nothing special. We can see the mechanical and MCMC procedures produce normal distributions which match almost everywhere. To see that, try this code:

plot(ecdf(z),xlab="Possible values of Z", ylab="Probability of Z < value", main="A standard normal") s = seq(-4,4,by=ep) lines(s,pnorm(s),lty=2,col=2) lines(ecdf(z.s),lty=3,col=3)

This is the (e)cdf of the distributions: mechanical Z (black solid), gremlin (green dot-dashed), analytic approximation (red dashed). The step in the middle is from the crude step in the mechanical. Play with the limits of the axis to "blow up" certain sections of the picture, like this:

plot(ecdf(z),xlab="Possible values of Z", ylab="Probability of Z < value", main="A standard normal", xlim=c(-1,1)) s = seq(-4,4,by=ep) lines(s,pnorm(s),lty=2,col=2) lines(ecdf(z.s),lty=3,col=3)

Try xlim=c(-4,-3) too.

Homework

Find the values of U_1 and U_2 that correspond to Z = -1. Using the modern language, what can you say about these values in relation to the (conditional!) probability Z < -1? Think about the probabilities of the Us.

What other simple transforms can you find that correspond to other common distributions? Try out your own code for these transforms.

May 2, 2017 | 35 Comments

The Solution To The Doomsday Argument

The Doomsday Argument! No, not global warming. The one that predicts the total number of humans who will ever live. It’s also called the Carter catastrophe; the same Carter famous for the anthropic principle. Here’s the Wiki article (HT to reader Nate West).

To solve this problem, the only rule we need is this: All probability is conditional—and conditional only on the information provided. The idea is that you’re born, you notice your birth, and you reason that your place in the order of all human births is nothing special. From that, can we conclude how many more of us we expect? This situation is analogous, at first, to balls in a bag.

Our evidence is X = “There are N balls labeled 1 through N in a bag, from which only one will be removed.” The probability of Y = “The ball will have label j, where j is from 1 to N inclusive” is 1/N, via the statistical syllogism. We deduce via the language used that N is finite (no bag can hold an infinite amount of any real thing).

Reach into the bag and pull out the ball B. It will have a label; call it B = j. Our evidence is now augmented: we have in toto X’ = “X and The ball has label j”. What can we say about N? Well, given X’, the probability N is less than j is 0, and the probability N is at least j is 1, both of which are obvious. But what about these interesting and relevant probabilities (both given X’, naturally): “N equals j”, “N is greater than j”?

We do not know.

Why? Because there is no information in X or X’ about the possible values of N, except that N must be at least equal to j (given X and not X’), information which is deduced. Now mentally you might add information that is not provided, by, say, thinking to yourself, “This j is awfully low and that’s such a big bag; therefore, surely N is large.” Or “I know this Briggs, who is a trickster. He made the bag big on purpose. N is small.” Or anything, endlessly. None of these additions are part of the problem (the stated evidence), however, and all such moves are “illegal” in probability. You cannot use information not provided. It is against the law!

Now suppose we legally augment our X and, for fun, say that N is some number in the set S. We don’t need to know much about S, except that it exists, is finite, and contains only natural numbers. Thus, X now equals “There are N balls labeled 1 through N in a bag, from which only one will be removed; and N is a number in the set S.” Given X, the probability “N is s_i (one of the set S)” is 1/#S, where “#S” stands for the number of elements in S (its cardinality, if you like); thus, the probability “N = s_i” is 1/#S, where I’ll assume the s_i are increasing in i. What about the probability that the ball withdrawn has label j? Here it gets tricky, so let’s be careful.

The key lies in realizing the bounds of j are between 1 and the largest value of S. First suppose N = s_1. We want:

Pr(B = j | N = s_1, X).

This is 1/s_1 for j = 1 to s_1, and 0 for all those j up to s_I (the largest value of S). Now

Pr(B = j | N = s_2, X)

equals 1/s_2 for j = 1 to s_2, and 0 for all values up to s_I. From this, we notice we have to be careful about specifying j precisely. From total probability we know

Pr(B = j | X ) = Pr(B = j | N = s_1, X) * Pr(N=s_1|X) + … + Pr(B = j | N = s_I, X) * Pr(N=s_I|X)

and where knowledge of j is relevant to the probability. If j = 1, then

Pr(B = 1 | X ) = [(1/s_1) + … + (1/s_I)] * (1/#S)

but if j a number larger than, say, s_1 but smaller than s_2, then (call this j’)

Pr(B = j’ | X ) = [0 + (1/s_2) + … + (1/s_I)] * (1/#S)

and so forth for other j (don’t forget S is known).

The ball is withdrawn and B = j. Can we now say anything more about N? As before, there is 0 probability N is less than j, and so if j is greater than some s_i, there is 0 probability N equals those s_i. We can do more, using the good reverend’s rule, but it’s still tricky:

Pr(N = s_i | B = j, X) = Pr( B = j | N = s_i, X) * Pr( N = s_i | X) / Pr( B = j | X).

First suppose j = 1, then

Pr(N = s_i | B = 1, X) = [(1/s_i) * (1/#S)] / ([(1/s_1) + … + (1/s_I)] * (1/#S))

     = (1/s_i) / [ 1/s_1 + 1/s_2 + … + 1/s_I]

If you stare at that fraction for a moment, and recalling that the s_i are given in increasing number, you realize that values of smaller N are more probable than larger values. As a for-instance, suppose S = {20,21,…,40}, which has cardinality 21. Given X, the probability “B = 1″ is (1/20 + 1/21 + … + 1/40) * (1/21) = 0.02761295. Thus Pr(N = 20 | B = 1, X) = 0.04416451, Pr(N = 21 | B = 1, X) = 0.04206144, etc. out to Pr(N = 40 | B = 1, X) = 0.01472150. Notice that these probabilities do not change for j between 1 and 20.

In this same example, next let j = 21, then

Pr(N = s_i | B = 21, X) = Pr( B = 21 | N = s_i, X) * Pr( N = s_i | X) / Pr( B = 21 | X).

For “N = 20″, the first term on the right equals 0, and so Pr(N = s_i | B = 21, X) = 0, as desired. For “N = 21″, we have

Pr(N = 21 | B = 21, X) = Pr( B = 21 | N = 21, X) * Pr( N = 21 | X) / Pr( B = 21 | X).

     = [ (1/21) * (1/21) ] / ([0 + 1/21 + 1/22 + … + 1/40] * (1/21))

     = (1/21) / [0 + 1/21 + 1/22 + … + 1/40] = 0.06994537,

and for Pr(N = 22 | B = 21, X) = 0.06676604, out to Pr(N = 40 | B = 21, X) = 0.03672132.

Collecting all these tidbits leads to the conclusion that smaller (but not impossible) values of N are always more likely than larger, regardless of the value of j. Why? That’s easy. Before we see B, the possible values of N are s_1, s_2, and so on up to S_I, each equally likely. After we see B, some values of N (from S) might now be impossible, but since j will always be less than any remaining possible larger members of S, smaller values of N are closer to j than larger, thus smaller values are more likely. Simple as that.

What does this have to do with Doomsday? Everything. The crucial step was in conjuring the set S. Where did that come from? I made it up. S was known throughout second part of the calculations and unknown through the first part. When S was unknown, N was unknown, and there was nothing we could say about N except that it had to be as large as j. I mean nothing in its literal, logical sense. In that case, given only that you witness your birth order, your B = j that is, we are blind about the future of humanity.

When S was known, we had a rough idea of what N was, which we tightened slightly by learning where N might not be (by removing the ball). But for an S with large cardinality, we aren’t learning much by viewing B. S is what we started with, and something very like S is what we ended with. But this is cheating because I made the S up. We wanted N, of which we are ignorant, and then we pretend we know an S that tells us something but not everything about N! All the other solutions to the Doomsday argument I have seen also make up S, but then they add an extra layer of cheating. We posited a discrete finite S, from which deduced that N might equal any of its members with equal probability (before seeing B). But those who conjure up more creative S often fix the set so that smaller values of S are more likely (hence smaller values of N are more likely, even before we see B). Some form of exponential “distribution” for S is popular. Some even use non-probability arguments (called “improper priors”), which is triply cheating.

Once S is fixed, however it is fixed, the calculations flow in the same manner as above, but it’s easy to see that smaller values of N are always going to be more likely than larger, and that’s because the j will always be smaller (or no greater) than the maximum value of S. And given that some let S toodle out to infinity, it’s no shock at all to discover that N is not expected to be big.

Thus the Doomsday Argument is really a non-problem which includes its own answer in its formulation, which is cheating. Of course, it makes perfect sense to ask the question of how many of us there will be left, but trying to discover the answer using only your birth order is doomed to failure (beyond proving that N must be at least as large as j). Since all probability is conditional on only the information supplied, many different answers for our future numbers are possible. It’s easy to think of probative information: demographics, politics, epidemics, apocalypses (rocks from the sky, Christ’s return, etc.), and on and on. (Of course, some of these sets of information may lead to the guesses people have made about S.) I do not (now) have a good answer how to use these to put uncertainty on (the real) N.

Update Bayes’s theorem isn’t all that.

The difficulty lies in misunderstanding Bayes’s theorem, which some mistakenly write like this:

Pr(N = s_i | B = j) = Pr( B = j | N = s_i) * Pr( N = s_i ) / Pr( B = j ),

where the evidence about N in X is left off (finding the denominator is no problem because Pr( B = j ) = SUM_i Pr( B = j | N = s_i) * Pr( N = s_i )). Pr( N = s_i ) is thus “naked” (and violates the rule that all probability is conditional), yet users of Bayes’s theorem are trained to posit “priors” like this, and so posit one they do. It seems, say critics of the theory, that these priors are pulled from thin air. The critics are right. It’s completely arbitrary to conjure a Pr( N = s_i ), and so the resulting Pr(N = s_i | B = j) cannot be trusted. (I have much more about this kind of thing in my forthcoming book.)

Of course, I made up my own “prior”, but referenced as being a deduction from X. The probability Pr(N = s_i | B = j, X) is thus true. The attention then focuses on X, where it belongs. Why this X? No reason at all. If we’re after the best information about N, that is what should go into X. But it has to be information that is not N itself, like my S was. My S was merely a presumption that I already knew a lot about N; it was N by proxy, but a fuzzy proxy. Cheating, like I said.

It’s not Bayes’s theorem that’s the problem. It works just fine when we supplied information in X about S. But it also worked dandy when X was just “There are N balls labeled 1 through N in a bag, from which only one will be removed.” I didn’t display the equation at the time, but it’s there. I’ll leave it as homework for you to show.

Update I’m graduating a comment I made in reply to Steve Brookline to the main post, because it highlights what I think is the central error people make in the DA. SB’s comments should be examined for orientation. I’m repeating them here in concise form.

A standard application of the DA starts by asking for this:Pr(N < 20j) (the 20 comes from the magic number in statistics). Note the missing conditions. Accepting the bare notation, then Pr(N < 20j) = Pr(N/20 < j) = Pr(j > N/20) = 1 – Pr(j <= N/20) = 1 – 0.05 = 0.95. It is said Pr(j <= N/20) = 0.05 because j is “uniform” or is “uniformly distributed”, as if probability has life. The fatal error has been made, because we notice that this result appears to hold regardless what value N or j has. But there just is no such thing as “Pr(N < 20j)”.

We have to be careful with the notation. There is no such thing as unconditional probability, and when you drop the conditions, which often makes manipulating the equations easier, you run the risk of introducing error, which is what happens in the standard doomsday argument. Here’s what we want.

Pr(N < 20*j | B = j, X) = Pr(B = j | N < 20*j, X) * P(N < 20*j | X) / Pr(B = j | X).

(For why we want this, see SB’s comments.) Now X can be anything relevant; it as least says there are balls 1 through N, but it must also say something about N (directly or implied).

Suppose X contains information that N is in the set {1, 2, …, 19}. Then Pr(N < 20*j |X) = 1 for any j. Never forget j runs from 1 to N, which is where things go awry: j is (in the classical language) dependent on N; in the new (and proper) language, knowledge of N is relevant to knowledge of j.

This is it: it appears, because of loose notation, many forget that j and N are related. Steve used the notion of cutting a string; but of course, that can only be done quantumly (i.e. discretely), so the example is the same. Knowledge of the place j where you cut depends on knowledge of the length of the string N, and vicesy versey.

You can work it out, but the result is the right-hand-side is 1/1, and thus Pr(N < 20*j | B = j, X) = 1, as expected. So right here is all the proof I need to show that at least one “prior” on N ruins that 95% finding.

Here’s another one. Suppose X says N = 20. Then Pr(N < 20*j |X) = 0 for j = 1, and Pr(N < 20*j |X) = 1 for j > 1. Again, you can work it out, but it amounts to the same thing, that Pr(N < 20*j | B = j, X) = 0 when j = 1, else it equals 1 for all other j.

Again, suppose X says N is in set {20, 21, …, 40}. Starts to get interesting. I leave this one as a homework, too.

More about the DA is in my book Uncertainty.