# Rumsfeld and Keynes on Probability

In 2003, ex-sailor and then Secretary of Defense Donald Rumsfeld, said the following:

As we know, there are known knowns. There are things we know we know. We also know there are known unknowns. That is to say, we know there are some things we do not know. But there are also unknown unknowns, the ones we don’t know we don’t know.

Many in the mainstream press chuckled warmly upon hearing this. “What a typical, right-wing dolt!” they thought to themselves. Some said so out loud. For a brief while, it was enough to induce a laugh among our betters to merely say, “Unknown unknowns.” It might still work. Try it.

Civilian reporter Hart Seely was so tickled that he arranged Rumsfeld’s words in the form of verse. He did so because he thought that the words resembled “jazzy, impromptu riffs.”

Luckily, Mr Seely was known as a “humorist”, so his readers knew that they should laugh. And so they did. (Read a sample of verbal chuckles accompanying this YouTube video.)

However, not only was Mr Rumsfeld right in what he said, he expressed what he said beautifully and concisely. Worse news for our cultural guardians, it turns out that Rumsfeld was echoing the sainted John Maynard Keynes.

In 1937, Keynes said:

By ‘uncertain’ knowledge I do not mean merely to distinguish what is known for certain from what is only probable…The sense in which I am using the term is that in which the prospect of a European war is uncertain, or the price of copper and the rate of interest twenty years hence, or the obsolescence of invention, or the position of private wealth owners in the social system in 1970. About these matters there is no scientific basis on which to form any calculable probability whatever. We simply do not know.

This is from his shockingly neglected *Treatise on Probability*. For Keynes, probability was a branch of logic. He divided statements into three rough, overlapping categories. Statements which could have quantitative probability assessed, those that were only comparative, and those which are impossible to quantify.

An example of the first kind, well known to regular readers. Premise: We have a six-sided object, just one side of which is labeled “6”, which will be tossed once; only one side can show. Given that premise, the probability of the statement, “A ‘6’ will show” is 1 in 6.

An example of the first kind with data. Premise: the data of age of death and select biological characteristics for a large group of people. Given that premise (data can be a premise), and another about how that data is modeled, the probability of statements like, “John Smith will die aged 70 or older” can be computed to reasonable degree. These probabilities allow insurance agents to accost you.

A comparative example. Premise: From a bag in which there are more of A than B, one item will be drawn. Given that premise, we can do no better than to say that drawing an A is more likely than drawing a B. Many examples with vague data will suggest themselves.

A Rumsfeldian example of unknown unknowns. Before the fauna of Australia was well known, the swan was often used to illustrate logical arguments. Premise: All swans are white; Art is a swan. The conclusion, “Art is white” logically follows (it has probability 1).

While the European discovery of the black swan in no way invalidates the “white swan” argument, because the premise is assumed true, finding that all swans were not white was a bit of a shock. Nobody expected it. The surprise was such that philosophers dropped swans from their menagerie and switched to the humanity of Socrates for their stock example.

Nicolas Taleb used this example in his popular book. He, too cautions that people are too sure of themselves and that many events remain unpredictable (though he at least intimates that he has developed an investment system that is more immune to uncertainty than systems offered by others).

Of the three Keynesian categories, I believe it is the third which is the largest. That is, it is the unknown unknowns that outnumber all other kinds of knowledge. I cannot prove this: there can obviously be no proof. But all experience, and evidence that the universe is vast and ancient, suggests it is so.

Explicit acknowledgment of this was given by Harold Macmillan when he was asked what was most likely to blow governments off course. “Events, dear boy. Events.” This is no different than saying, “Unknown unknowns, dear boy.”

Looking into the future is exactly like peering through a thick fog. You can see what is right next to you with great detail. Items a little further off lose detail, while those beyond are completely opaque. And what you can see is different from what your neighbor can see.

A couple of us wrote an Interoffice Memo on this issue back in the early 1970s. We decided that x is very likely the most well-known unknown followed closely by y, and that z appears frequently as an unknown.

A few letters from the Greek alphabet are also well-known unknowns. We finally concluded that Xi is very likely the most unknown unknown because most people don’t even know its name and simply called it squiggly.

( It is much better when you can actually display squiggly in the text. )

Surely the most potentially dangerous bits of knowledge are ‘the things we know that ain’t so.’

And, I would submit, for most of us these outnumber all but unknown unknowns in our store of knowledge.

An economist, and engineer, and a statistician are riding a train having just crossing the English – Scottish border.

They look out the window and see a black sheep grazing on a hillside near the track.

“I see that sheep in Scotland are black”, says the economist.

“No”, corrects the engineer, “some sheep in Scotland are black”.

The statistician replies, “We can only say that there is at least one sheep in Scotland that is black on at least one side”.

For some reason Briggs’ post reminded me of this old chestnut.

Interesting take on Rumsfield, Keynes and Taleb. We might posit that there exists a fourth category: the unknowns knowns representing the things we think we know but really don’t. An example of this might be climate science!

Makes sense to this typical right wing dolt.

Rumsfeld’s statement was a useful IQ test – anyone who thought it stupid was stupid. It’s nice to think that Rumsfeld had a useful purpose.

Quite. I am no fan of Rumsfeld, yet another politico bureaucrat who imagines that his superior intelligence untempered by experience or study qualified him to run a war, in fact that most dangerous of all wars, a military adventure.

But at the time I pointed out to people his statement quoted above was an pretty succinct summary. And a good one.

Back in the 60’s and 70’s there was much ado here about lateral thinking and several charlatans like Uri Geller pushing the paranormal etc, with people speculating wildly about it all. In fact it is usually very difficult, even if you are familiar with stage magical techniques, to deconstruct a trick backwards from what you see unless it is merely a version of a well known method. At that time it was fashionable to ask scientists to try to do this, can’t be explained by science etc, but they are the least equipped by nature and training to do that: remember Sir Oliver Lodge?

And I used to point out, quite a bore on the subject I must have been, that if you do not know how the illusion has been done you do not know: all and any explanation from little green men from Mars to elaborate mechanical devices is equally valid. And equally useless.

One modern illusionist who uses this misdirection to great effect is Derren Brown, I don’t know whether he performs in the USA, but he is very entertaining even to jaundiced old eyes like mine.

This is not Occam’s razor by the way, although people tend to imagine that it is, rather as I used to repeat ad nauseam, if you know how it is done you do know, if you do not know you do not know.

Although speculating on such trivial things over a beer can be quite amusing.

Kindest Regards

Dan Hughes: Do you mean like this Æº?

For some reason, what Rumsfeld said doesnâ€™t chuckle me, but this does. I can just imagine how flattered Keynes and Rumsfeld would feel by it. This post has the name Rumsfeld in it but is otherwise fine.

Arenâ€™t there also unknwon knowns? For example, the unproven Goldbach’s conjecture (every even number can be written as the sum of two primes). Maybe the conjecture should be classified as an unknown unknown.

Do we really know that we know that there are unknowns that we don’t know that we don’t know about, but think we know about but don’t?

“We have a six-sided object, just one side of which is labeled â€œ6â€³, which will be tossed once; only one side can show. Given that premise, the probability of the statement, â€œA â€˜6â€² will showâ€ is 1 in 6.”

I would say that this is simply an uniformed prior. This probability assessment represents nothing much about the object. It represents our state of knowledge – our mental model of the object – which is that we have no information that would lead us to believe that any one side is more likely to come up than another.

So in my Bayesian world probability is the language we use to speak about uncertainties, and talk of unknown unknowns is more complicated.

This doesn’t, by the by, make Rumsfeld or Keynes stupid, for both had obviously put more thought into it than your average journalist is capable of.

â€œWe have a six-sided object, just one side of which is labeled â€œ6â€³, which will be tossed once; only one side can show. Given that premise, the probability of the statement, â€œA â€˜6â€² will showâ€ is 1 in 6.â€

I agree with txslr’s comment. Let me make this clearer by stating an analagous experiment:

“We have an opaque bag of balls, each coloured red, orange, yellow, green, indigo, or blue. A single ball is drawn from the bag.” You surely cannot conclude that P(a red ball is drawn) = 1/6!

Your conclusion with the “six-sided object” is reached only by assuming the object is an unbiased cube.

— Rafe

Michael de Montaigne beat Will Rogers (sort of) by several centuries:

“Nothing is so firmly believed as that which is least known.”

==========================================

And now for something completely off-topic (because I don’t know where else to put, and I thought you (William) would be interested):

Melody Gardot.

Since you seem such a fan of The Great American Songbook, I thought you’d be heartened to know that 24-year old Ms Gardot has been proving that that vein is far from exhausted. Her latest, My One And Only Thrill, is a disc of classic torch ballads – complete with silky string arrangements – except that she wrote most of them (and her back-story is inspirational).

Just trust me & buy it. And then compare & contrast with Billy Strayhorn.

Rafe,

Nope. In fact, I agree that the probability that, given your premise, a red ball is drawn is 1/6. Incidentally, no probability is unconditional, so it is better to write it as P(red | Evidence), where the Evidence is your premise.

To say “unbiased” (I show in another paper) is to make your argument circular. What is “unbiased” but another way to say “each ball is equi-probable”?

I guess another way to look at it is, if the probability isn’t 1/6, what is it? More? Less? Why? If you have to pick a number, that number has to be 1/6.

But it still bothers me that there should be a single number at all. What if you have an unbiased dice which has a 50% chance of having six sides and a 50% chance of having 4 sides – does it make sense to say that the chance of getting a 1 is 5/24? Why do we have to average it out, can’t we say there’s a 50% chance that the chance is 1/4 and a 50% chance that the chance is 1/6?

Sure, 5/24 is the expected probability in my example, but you’d frown upon anybody quoting a single number to represent data that clearly have two distinct clusters, and I don’t see why that argument shouldn’t also apply to probability statements drawn on uncertain or incomplete evidence.

Going back to the original point, then, do we have enough information to say 1/6, or should we just say we don’t know? If the latter, then this, too, is not something we can be quantitative about – it’s more like guessing that all swans are white, because we’ve never seen a black one. And that’s something you quoted as an unknown unknown. Following on from that, it’s hard to see how anything is really known in the real world – there’s almost always some piece of evidence that could turn up and change everything unexpectedly.

So maybe we hardly ever know anything, and should always admit that.

George,

You’re on the right path. Suppose our premise is: An empirically measurable event S might occur. Given that premise, the probability of S is greater than 0 and less than 1.

And you cannot do better.In the n-sided object, and bag-with-n-items cases we have

positive, definiteevidence to compute a numerical probability. Is the measurable-event S case, we do not.Also, their premise “unbiased” is not needed, and if used leads to circularity.

Re: Briggs at 17 February 2010 at 6:03 am

I am not arguing that one must assume a flat distribution, merely that you must implicitly do so to deduce P(a red ball is drawn) = 1/6.

For my money, P(red | evidence) cannot be decided without further evidence or making some assumptions. Concluding P(red | evidence) = 1/6 seems to me like saying P(X is guilty | X is the only person on trial) = 1/2.

I believe I’ve read everything on your blog concerning P(whatever) = 1/6, but I can’t say I’ve seen anything concrete enough to turn into a proof.

To make things clear (and I am genuinely interested in this), could you present, say, a natural deduction proof of P(red | evidence) = 1/6? That would either convince everybody once and for all or allow us to put our finger on the contentious step in the argument.

Rafe,

It was hidden away under my Resume page. Read this paper, especially the section on “Mathematical Attempts.”

For your P(X is guilty | X is the only person on trial), I would say the answer is

0 < P(X is guilty | X is the only person on trial) < 1

and that is the best you can do. Given

thatpremise.If there is general interest, I can post the second permutation argument. It’s not easy, though.

I side with Rafe on this one. Saying the probability is exatly 1/6 is a pretty strong statement given the premise. Would you also say that the probability of getting k 6’s in n rolls is exactly C(n,k) (1/6)^k (1-1/6)^(n-k) ?

How is this different from saying that P(the die is unbiased| you are given a die)=1 ?

SteveBrooklineMA,

With that premise, I would say that, yes. Why I believe this is best laid out the paper I linked earlier. I’m sure we’ll talk more about this later.

William, thanks for the link to your paper; I’m printing it out now. Hopefully it will clear up my confusion. I had thought you were arguing that P(X = 1 | X in 1..n) = 1/n, but clearly that’s not the case for the P(X = guilty | X is on trial) problem.

Hi William,

I’ve just finished reading your paper and, assuming I’ve understood it, you argue that Williams’ statistical syllogism allows us to say P(X = heads | X = heads \/ X = tails) = 1/2. However, you immediately go on to say, “This probability assignment, made explicit in the form of the statistical syllogism, is *derived* from ASSUMING UNIFORM PROBABILITY across the individual events that make up the ‘sample space'” [emphasis mine]. Later on you say “The uniform probability assumption over events that is used to derive the statistical syllogism is just true”. Here I hit the same sticky spot: if it’s just true, then there’s no need to make it an assumption! But if it is true, where’s the proof?

Cheers,

— Rafe

I’ve read the paper too, and have similar questions as Rafe. We’ve all had an interesting discussion about this question before

http://wmbriggs.com/blog/?p=1277

The statement “P(the die is unbiased | you are given a die)=1” where “die” here can even mean a more general six sided object as described above seems unreasonable. You’re certain all dice are unbiased?

SteveBrooklineMA & Rafe,

The language we use to describe evidence is crucial. So I would, as I say in the paper, never say “unbiased”, as all probability assignments become circular.

And the principle upon which the statistical syllogism relies is axiomatic. No proof. But neither has there been any disproof.

When I return to my books in a week or so, we’ll probably come back to this.

At the risk of belaboring a point, what the probability assessment of 1/6 reflects is not some sort of Platonic probability that is associated with that 6-sided die. It reflects your state of knowledge. I sometimes teach classes on probabilities and one of my favorite pedagogical tools is to have the class bet on the outcome of a coin toss. When I ask the class members to assess the probability of a getting a heads on the next toss I almost always get someone who says “If the coin is fair, it is 50%”. My rote response is “You’ve caught me. This coin is NOT fair. It yields one result much more than the other. Now, what is the chance that it will come up heads?”

A small amount of squirming now begins. Then I’ll say, “In fact, both sides of this coin are the SAME! NOW what is that chance that it will come up heads?”

If that doesn’ t hammer the point home I’ll put the coin in a cup, shake it up and put the cup, with the coin inside, upsidedown on the desk. “Now the coin toss has ALREADY HAPPENED! What is the probability that it is heads?”

Somewhere along in there folks figure out that the probability assessment is a representation of their state of knowledge – a way of talking about uncertainty.

Amen, jl, exactly so.

Thanks jl, your comment helps me understand Briggsâ€™ and your position.

I still have reservations though. I think whatâ€™s going on in your studentsâ€™ heads during this (very cool) exercise is a confusion, or a blending or an identification of the concepts E(Pr(heads)) and Pr(heads). Here E() is some sort of not well-defined expectation, perhaps based on life experience. When you ask them for â€œthe probability of getting headsâ€ their understanding is that you are asking for Pr(heads) for a coin they know nothing about. Their experience tells them that Pr(heads) may take on many values, depending on the coin. There are many coins in their experience, not all of which have Pr(heads)=1/2 exactly. Thus their concept of Pr(heads) for an unknown coin is that it is not a single number, but is rather some distribution on the values between 0 and 1. While they do not know that distribution exactly, they estimate from experience that the mean value for Pr(heads) is very near 1/2, and that the peak of the distribution is very near 1/2 as well. When you repeatedly ask the question as if you are expecting it to be answerable, they begin to believe that you are not asking what Pr(heads) is, because they believe that question to be un-answerable. Instead, they begin to believe you are asking for the value of E(Pr(heads)) or for the point at which the distribution of Pr(heads) is maximal. The students believe that both of these statements are answerable, their best estimate for the answer being 1/2.

So while I agree that the students are figuring out something during this process, itâ€™s not clear to me that they are reassessing their concept of probability as a way of talking about uncertainty. They may instead be reassessing what it is you are asking for, or they may be blending the concepts of Pr(heads) and E(Pr(heads)) together.

Steve,

Well put and a good point. I think of what you call E(Pr(Heads)) as being an assessment based on a mental model of the physical world and the Pr(Heads) as not really existing. So, in the same class, before we play the coin toss but after everyone has bid for the chance to play the coin toss game (with a $100 prize if they call it correctly), I offer the same deal on the toss of a thumbtack. A normal thumbtack can land either point up or point down (resting on the point AND the head). I typically find that about 1/3 of students will bid the same amount to play the coin game as they bid to play the thumbtack game , 1/3 will bid more for the thumbtack, and the 1/3 remaining will bid less.

My next step is to ask the people who bid more for the thumbtack game to explain why they did so. They always say that they have a view as to how the thumbtack will behave â€“ they have a mental model â€“ and they believe it is more likely to land one way than the other.

Then I press the people who bid less on the thumbtack for their reasons. They inevitably say that they bid less because they â€œunderstandâ€ how coins work when tossed, but they have no experience with tossed thumbtacks. That is, they have no mental model of thumbtack tossing.

This is when I try to convince them that, if they have absolutely no knowledge about how the thumbtack will behave then the ONLY thing they know is that there are two possible outcomes, in which case the correct probability assessment is 50% for whichever side they call. Put another way, if there are two possible outcomes to an uncertain event, total ignorance is demonstrated in an assessment of 50-50, and so your bid for the thumbtack should NEVER be less than the coin. You can never know less in a binary trial (that you know is binary) than you know in a standard coin toss.

Some of the people who bid less for the thumbtack will still, at this point, not get my point. My final pedagogical device is to say, â€œIâ€™m telling you that the coin toss is the worst that you can do, so you can always turn your thumbtack into a coin toss byâ€¦.randomizing your selection! Toss a coin, and if it comes up heads pick pin-up, and if it comes up tails pick pin-down. Now, at no cost to you, the confusing thumbtack game has been turned into the simple coin toss. Which proves, once more, that you canâ€™t do worse than a coin toss.â€

jl, I have a hard time accepting that P(X = x) is a statement of knowledge. This whole debate is about what you can say about P(X = x) when (a) you know X has just n possible states and (b) you don’t know the probability distribution of those states. I, and I think SteveBrooklineMA, claim you can say nothing about P(X = x) in this case. You and William seem to be claiming that, given (b), you can infer “the probability distribution is flat”. This bugs me on a moral level because it reeks of non-monotonic logic (i.e., it’s built on sand).

For example, if you say “I have already tossed a biased coin, what is the probability that heads is showing?”, then we *know* from the premise that P(heads) =/= 1/2. However, your reasoning, as I understand it, claims that P(heads) = 1/2.

Also, I don’t think whether you are talking about something that will happen or something that has happened makes any difference.

The only way I can see around this difference of position is if you and William are saying that, yes, in the absence of knowledge of the distribution one must assume it is flat (and that you have plausible arguments to say this is as good as one can do). If so, then we can proceed on that basis, but I’d like to see the nuts and bolts of the argument written down in, say, some modal logic of knowledge.

I’m intrigued by this whole discussion. Either one of us is wildly wrong or we are talking at cross purposes because we haven’t made our starting points clear enough.

Rafe,

I promise to hit this later, after my travels (another week). If I am tardy, please feel free to email and remind me.

Looks like I am agreeing with Rafe. I especially like his biased coin flip point. Sorry to Briggs for taking over his blog… 🙂

Here’s my take on this. Suppose you have a coin I know nothing about. It might even be thumbtack shaped. All I know is that when flipped it will come up heads or tails. Suppose I have a coin I think is fair. We flip the coins…

1) When both coins are in the air, I think there is a 50-50 chance that the coins will come up matching. This is because I assume the coins are independent. If I let p be the probability that your coin will come up heads, then Prob(coins agree)=Prob(HH or TT) = Prob(HH) +Prob(TT) =1/2*p +1/2*(1-p)=1/2. So the probability is 1/2 regardless of what p is.

2) If your coin lands first and shows H to me, than at that point I’d say the chance the coins will match is 1/2. This is because my coin is fair, so there is a 50-50 chance it will come up heads and match yours.

3) If my coin lands first and shows H to me, then at that point I can no longer say what the probability that the coins will match is. All I know is that the probability of matching is the probability p that your coin will come up heads, and I don’t know what that p is. Based on life experience, MAXENT or some other thought process criticized in Prof Briggs’ paper, I might estimate that E(p)=1/2. Is that the same as saying p=1/2? I’m not so sure.

Note case 3 above pretty much happens at the start of every Super Bowl. The coin is flipped, and some guy calls heads or tails (uniformly??) at random before it lands. Can we say at that point the probability of his being right is 1/2? I’m not so sure.