## Looks like an own goal to me

Friend of humanity, meteorologist, and philosopher Tom Hamill reminds us of this clip:

Which reminds me to this clip, one of the very few songs the lyrics of which I have manage to memorize:

Skip to content
# Category: Philosophy

October 11, 2008 | 7 Comments## Looks like an own goal to me

October 9, 2008 | 21 Comments## Why probability isn’t relative frequency: redux

September 22, 2008 | 35 Comments## Not all uncertainty can be quantified

September 16, 2008 | 20 Comments## The limits of statistics: black swans and randomness

The philosophy of science, empiricism, a priori reasoning, epistemology, and so on.

Friend of humanity, meteorologist, and philosopher Tom Hamill reminds us of this clip:

Which reminds me to this clip, one of the very few songs the lyrics of which I have manage to memorize:

(Pretend, if you have, that you haven’t read my first weak attempt. I’m still working on this, but this gives you the rough idea, and I didn’t want to leave a loose end. I’m hoping the damn book is done in a week. There might be some Latex markup I forgot to remove. I should note that I am more than half writing this for other (classical) professor types who will understand where to go and what some implied arguments mean. I never spend much time on this topic in class; students are ready to believe anything I tell them anyway. )

For frequentists, probability is defined to be the frequency with which an event happens in the limit of “experiments” where that event can happen; that is, given that you run a number of “experiments” that approach infinity, then the ratio of those experiments in which the event happens to the total number of experiments is *defined* to be the probability that the event will happen. This obviously cannot tell you what the probability is for your well-defined, possibly unique, event happening now, but can only give you probabilities in the limit, after an infinite amount of time has elapsed for all those experiments to take place. Frequentists obviously never speak about propositions of unique events, because in that theory there *can be no unique events*. Because of the reliance on limiting sequences, frequentists can never know, with certainty, the value of any probability.

There is a confusion here that can be readily fixed. Some very simple math shows that if the probability of A is some number p, and it’s physically possible to give A many chances to occur, the relative frequency with which A does occur will approach the number p as the number of chances grows to infinity. This fact—that the relative frequency sometimes approaches p—is what lead people to the backward conclusion that probability *is* relative frequency.

Logical probabilists say that sometimes we can deduce probability, and both logical probabilists and frequentists agree that we can use the relative frequency (of data) to help guess something about that probability if it cannot be deduced^{1}. We have already seen that in some problems we can deduce what the probability is (the dice throwing argument above is a good example). In cases like this, we do not need to use any data, so to speak, to help us learn what the probability is. Other times, of course, we cannot deduce the probability and so use data (and other evidence) to help us. But this does not make the (limiting sequence of that) data *the* probability.

To say that probability is relative frequency means something like this. We have, say, observed some number of die rolls which we will use to inform us about the probability of future rolls. According to the relative frequency philosophy, those die rolls we have seen are embedded in an infinite sequence of die rolls. Now, we have only seen a finite number of them so far, so this means that most of the rolls are set to occur in the future. When and under what conditions will they take place? How will those as-yet-to-happen rolls influence the actual probability? Remember: these events have not yet happened, but the totality of them *defines* the probability. This is a very odd belief to say the least.

If you still love relative frequency, it’s still worse than it seems, even for the seemingly simple example of the die toss. What *exactly* defines the toss, what explicit reference do we use so that, if we believe in relative frequency, we can define the limiting sequence?^{2}. Tossing *just this* die? *Any* die? And how shall it be tossed? What will be the temperature, dew point, wind speed, gravitational field, how much spin, how high, how far, for what surface hardness, what position of the sun and orientation of the Earth’s magnetic field, and on and on to an infinite list of exact circumstances, none of them having any particular claim to being the right reference set over any other.

You might be getting the idea that *every* event is unique, not just in die tossing, but for everything that happens— every physical thing that happens does so under very specific, unique circumstances. Thus, nothing can have a limiting relative frequency; there are no reference classes. Logical probability, on the other hand, is not a matter of physics but of information. We can make logical probability statements because we supply the exact conditioning evidence (the premises); once those are in place, the probability follows. We do not have to include every possible condition (though we can, of course, be as explicit as we wish). The goal of logical probability is to provide conditional information.

The confusion between probability and relative frequency was helped because people first got interested in frequentist probability by asking questions about gambling and biology. The man who initiated much of modern statistics, Ronald Aylmer Fisher^{3}, was also a biologist who asked questions like “Which breed of peas produces larger crops?” Both gambling and biological trials are situations where the relative frequencies of the events, like dice rolls or ratios of crop yields, can very quickly approach the actual probabilities. For example, drawing a heart out of a standard poker deck has logical probability 1 in 4, and simple experiments show that the relative frequency of experiments quickly approaches this. Try it at home and see.

Since people were focused on gambling and biology, they did not realize that some arguments that have a logical probability do not equal their relative frequency (of being true). To see this, let’s examine one argument in closer detail. This one is from Sto1983, Sto1973 (we’ll explore this argument again in Chapter 15):

Bob is a winged horse

————————————————–

Bob is a horse

The conclusion given the premise has logical probability 1, but has no relative frequency because there are no experiments in which we can collect winged horses named Bob (and then count how many are named Bob). This example, which might appear contrived, is anything but. There are many, many other arguments like this; they are called *couterfactual arguments*, meaning they start with a premise that we know to be false. Counterfactual arguments are everywhere. At the time I am writing, a current political example is “If Barack Obama did not get the Democrat nomination for president, then Hillary Clinton would have.” A sad one, “If the Detroit Lions would have made the playoffs last year, then they would have lost their first playoff game.” Many others start with “If only I had…” We often make decisions based on these arguments, and so we often have need of probability for them. This topic is discussed in more detail in Chapter 15.

There are also many arguments in which the premise is not false and there does or can not exist any relative frequency of its conclusion being true; however, a discussion of these brings us further than we want to go in this book.^{4}

Haj1997 gives examples of fifteen—count `em—fifteen more reasons why frequentism fails and he references an article of fifteen more, most of which are beyond what we can look at in this book. As he says in that paper, “To philosophers or philosophically inclined scientists, the demise of frequentism is familiar”. But word of its demise has not yet spread to the statistical community, which tenaciously holds on to the old beliefs. Even statisticians who follow the modern way carry around frequentist baggage, simply because, to become a statistician you are *required* to first learn the relative frequency way before you can move on.

These detailed explanations of frequentist peculiarities are to prepare you for some of the odd methods and the even odder interpretations of these methods that have arisen out of frequentist probability theory over the past ~ 100 years. We will meet these methods later in this book, and you will certainly meet them when reading results produced by other people. You will be well equipped, once you finish reading this book, to understand common claims made with classical statistics, and you will be able to understand its limitations.

(One of the homework problems associated with this section)

{\sc extra} A current theme in statistics is that we should design our procedures in the modern way but such that they have good relative frequency properties. That is, we should pick a procedure for the problem in front of us that is not necessarily optimal for that problem, but that when this procedure is applied to similar problems the relative frequency of solutions across the problems will be optimal. Show why this argument is wrong.

———————————————————————

^{1}The guess is usually about a parameter and not the probability; we’ll learn more about this later.

^{2}The book by \citet{Coo2002} examines this particular problem in detail.

^{3}While an incredibly bright man, Fisher showed that all of us are imperfect when he repeatedly touted a ridiculously dull idea. Eugenics. He figured that you could breed the idiocy out of people by selectively culling the less desirable. Since Fisher also has strong claim on the title Father of Modern Genetics, many other intellectuals—all with advanced degrees and high education—at the time agreed with him about eugenics.

^{4}For more information see Chapter 10 of \citet{Sto1983}.

(This essay will form, when re-written more intelligently, part of Chapter 15, the final Chapter, of my book. Which is coming….soon? The material below is not easy nor brief, folks. But it is very important.)

To most of you, what I’m about to say will not be in the least controversial. But to some others, the idea that not all risk and uncertainty can be quantified is somewhat heretical.

However, the first part of my thesis is easily proved; I’ll prove the second part below.

Let some evidence we have collected—never mind how—be E = “Most people enjoy Butterfingers”. We are interested in answering the truth of this statement: A = “Joe enjoys Butterfingers.” We do not know whether A is true or false, and so we will quantify our uncertainty in A using probability, that is written like this

#1 Pr( A | E )

and which reads “The probability that A is true *given* the evidence E”. (The vertical bar “|” means “given.”)

In English, the word *most* at least means *more than half*; it could even mean *a lot more than a half*, or even *nearly all*—there is certainly ambiguity in its definition. But since *most* at least means *more than half*, we can partially answer our question, which is written like this

#2 0.5 < Pr( A | E ) < 1
and which reads "The probability that A is true is greater than a half but not certain *given* the evidence E.” This answer is the best we can do with the given evidence.

This answer is a quantification of sorts, but it is not a direct quantification like, say, the answer “The probability that A is true is 0.673.”

It is because there is ambiguity in the evidence that we cannot completely quantify the uncertainty in A. That is, the inability to articulate the precise definition of “most people” is the reason we cannot exactly quantify the probability of A.

The first person to recognize this, to my knowledge, was John Maynard Keynes is his gorgeous, but now little read, *A Treatise on Probability*, a book which argued that all probability statements were statements of logic To Keynes—and to us—all probability is conditional; you cannot have a probability of A, but you can have a probability of A with respect to certain evidence. Change the evidence and change the probability of A. Stating a probability of A unconditional on any evidence disconnects that statement from reality, so to speak.

**Other Theories of Probability**

For many reasons, Keynes’s eminently sensible idea never caught on and instead, around the same time his book was published, probability theory bifurcated into two antithetical paths. The first was called *frequentism*: probability was defined to be that number which is the ratio of experiments in which A will be true divided by the total numbers of experiments as that number of experiments goes to infinity^{1}. This definition makes it *difficult* (an academic word meaning *impossible*) to answer what is the probability that *Joe*, our Joe, likes Butterfingers. It also makes it *difficult* to define the probability for any event or events that are constrained to occur less than an infinite number of times (so far, this is all events that I know of).

The second branch was *subjective Bayesianism*. To this group, all probabilities are experiences, feelings that give rise to numbers which are the results of bets you make with yourself or against Mother Nature (nobody makes bets with God anymore). To get the probability of A you poll your inner self, first wondering how you’d feel if A were true, then how you’d feel if A were false. The sort of ratio, or cut point, where you would feel equally good or bad becomes the probability. Subjective Bayesianism, then, was a perfect philosophy of probability for the twentieth century. It spread like mad starting in the late 1970s and still holds sway today; it is even gaining ground on frequentism.

What both of these views have in common is the belief that any statement can be given a precise, quantifiable probability. Frequentism does so by assuming that there always exists a class of events—which is to say, hard data—to which you can compare the A before you. Subjective Bayesianism, as we have seen, can always pull probabilities for any A out of thin air. In every conceivable field, journal articles using these techniques multiply. It doesn’t help that the many times probability estimates are offered in learned publications, they are written in dense mathematical script. Anything that looks so complicated *must* be right!

**Mathematics**

The problem is not that the mathematical theories are wrong; they almost never are. But because the math is right does not imply that it is applicable to any real-world problems.

The math often is applicable, of course; usually for simple problems and in small cases the results of which would not be in much dispute even without the use of probability and statistics. Take, for example, a medical trial with two drugs, D and P, given to equal numbers of patients for an explicitly definable disease that is either absent or present. As long as no cheating took place and the two groups of patients balanced, then if more patients got better using drug D, that drug is probably better. In fact, just knowing that drug D performed better (and no cheating and balance) is evidence enough for a rational person to prefer D over P.

All that probability can do for you in cases like this is to clean up the estimates of how much better D might be than P in new groups of patients. As long as no cheating took place and the patients were balanced, the textbook methods will give you reasonable answers. But suppose the disease the drugs treat is not as simply defined. Let’s write what we just said in mathematical notation so that certain elements become obvious.

#3 Pr ( D > P | Trial Results & No Cheating & Patients Like Before) > 0.5.

This reads, the probability that somebody gets better using drug D rather than P *given* the raw numbers we had from the old trial (including the old patient characteristics) *and* that no cheating took place in that trial *and* the new patients who will use the drugs “look like” the patients from the previous trial, is greater than 50% (and less than certain).

Now you can see why I repeatedly emphasized that part of the evidence that usually gets no emphasis: no cheating and patients “like” before. Incidentally, it might appear that I am discussing only medical trials and have lost sight of the original thread. I have not, which will become obvious in a moment.

Suppose the outcome of applying a sophisticated probability algorithm gave us the estimate of 0.72 for equation #3. Does writing this number more precisely help if you suppose you are the doctor who has to prescribe either D or P? Assume that no cheating took place in the old trial, then drug D is better if the patient in front of you is “like” the patients from the old trial. What is the probability she is so (given the information from the old trial)?

The word *like* is positively loaded with ambiguity. Not to be redundant, but write out the last question mathematically.

#4 Pr ( My patient like the others | Patients characteristics from previous trial)

The reason to be verbose in writing out the probability conditions is that it puts the matter starkly. It forces you, unlike the old ways of frequentisim and subjective Bayesianism, to specify as completely as possible the circumstances that form your estimate. Since all probability is conditional, it should always be written as such so that it is always seen as such. This is necessary because it is not just the probability from equation #3 that is important, equation #4 is, too. If you are the doctor, you do not—you *should* not—focus solely on probability #3 because what you really want is this:

#5 Pr ( D > P *&* My patient like before | Trial Results & No Cheating & Patients Character)

which is just #3 x #4. I am in no way arguing that we should abandon formal statistics which produces quantifications like equation #3. But I am saying that since, as we already know, exactly quantifying #4 is nearly impossible, we will be *too confident* of any decisions we make if we, as is common, substitute probability #3 for #5 because, not matter what, the probability of #3 *and* #4 both is always less than the probability of #3.

Appropriate caveats and exceptions are usually delineated in journal articles when using the old methods, but the results are buried in the text, which causes them to be weighed more or less importantly, and which give the reader a false sense of security. Because, in the end, we are left with the suitably highlighted number from equation #3, that comforting exact quantification reached by implementing impressive mathematical methods. That final number, which we can now see is not final at all, is tangible, and is held on to doggedly. All the evidence to the right of the bar is forgotten or downplayed because it is difficult to keep in mind.

The result to equation #3 is produced, too, only from the “hard data” of the trial, the actual physical measurements from the patients. These numbers have the happy property that they can be put into spreadsheets and databases. They are real. So real that their importance is magnified far beyond their capacity to provide all the answers. They fool people into thinking that equation #3 is the final answer, which it never is. It is always equation #5 that is important to making new decisions. Sometimes, in simple physical cases, probabilities #3 and #5 are so close as to be practically equal; but when the situation is complex, as it always is when involving humans, these two probabilities are not close.

**Everything That Can Happen**

The situation is actually even worse than what we have discussed so far. Probability models, the kind that spit out equation #3, are fit to the “hard data” at hand. The models that are chosen are usually picked because of habit and familiarity, but responsible practitioners also choose the models so that they fit the old data well. This is certainly a rational thing to do. The problem is that, since probability models are only designed to say something about *future* data, the *old* data does not always encompass everything that can happen and so we are limited in what we can say about the future. All we can say for certain is what has happened before might happen again. But it’s anybody’s guess whether what *hasn’t* happened before might happen in the future.

The probability models fit the *old* data well, but nobody can ever know how well they will fit *future* data. The result is that over reliance on “hard data” means that probabilities of extreme events are underestimated and mundane events overestimated. The simple way to state this is the system is built to engender overconfidence.^{2}

**Decision Analysis**

You’re still the doctor and you still have to prescribe D or P (or nothing). No matter what you prescribe *something* will happen to the patient. What? And when? Perhaps the malady clears up, but how soon? Perhaps the illness is merely mitigated, but by how much? You not only have to figure out what treatment is better, but what will happen if you apply that treatment. This is a very tricky business, and is why, incidentally, there is such a variance in the ability of doctors.^{3} Part of the problem is explicitly defining what is meant be “the patient improves.” There is ambiguity in that word *improve*, in what will happen with either of the drugs is administered.

There are two separate questions here: (1) defining events and estimating their probability of occurring and (2) estimating what will happen given those events occur. Going through both of the steps is called computing a *risk* or *decision analysis*. This is an enormously broad subject which we won’t do more than touch on, only to show where more uncertainty comes in.

We have already seen that there is ambiguity in computing the probability of events. The more complex these events the more imprecise the estimate. It is also often the case that part (2) of the risk analysis is the most difficult. The events themselves cannot be articulated, either completely or unambiguously. In simple physical systems they often can be, of course, but in complex ones like the climate or ecosystems they are not. Anything involving humans is automatically complex.

Take the current (!) financial crisis as an example. Many of the banks and brokerages failed to both define the events that are now happening, and they extent of the cost of those events. How much will it cost to clean it up? Nobody knows. This is the proper answer. We might be able to bound it—more than half a billion, say—and that might be the best anybody can say (except that I have been asked to pay for it).

**Too Much Certainty**

What the older statistical methods and the strict reliance on hard data and fancy mathematics have done is to create a system where there is too much certainty when making conclusions about complex events. We should all, always, take any result and realize that they are conditional on everything being just so. We should realize those just so conditions that obtained in the past might not in the future.

Well, you get the idea. There is already far too much information to assimilate in one reading (I’m probably just as tired of going on and on as you are of reading all this!). As always, discussion is welcome.

—————————

^{1}Another, common, way to say infinity is the euphemism “in te long run”. Keynes has famously said that “In the long run we shall all be dead.” It’s always been surprising to me that the same people who giggle at this quip ignore its force.

^{2}There is obviously a lot more to say on this subject, but we’ll leave it for another time.

^{3}A whole new field of medicine has emerged to deal with this topic. It is called *evidence based medicine*. Sounds good, no? What could be wrong with evidence? And it’s not entirely a bad idea, but there is an over reliance on the “hard data” and a belief that only this hard data can answer questions. We have already seen that this cannot be the case.

The author of *Fooled by Randomness* and *The Black Swan*, Nassim Nicholas Taleb, has penned the essay THE FOURTH QUADRANT: A MAP OF THE LIMITS OF STATISTICS over at Edge.org (which I discovered via the indispensable Arts & Letters Daily).

Taleb’s central thesis and mine are nearly the same: “Statistics can fool you.” Or “People underestimate the probability of extreme events”, which is another way of saying that people are too sure of themselves. He blames the current crisis on Wall Street on people misusing and misunderstanding probability and statistics:

This masquerade does not seem to come from statisticiansâ€”but from the commoditized, “me-too” users of the products. Professional statisticians can be remarkably introspective and self-critical. Recently, the American Statistical Association had a special panel session on the “black swan” concept at the annual Joint Statistical Meeting in Denver last August. They insistently made a distinction between the “statisticians” (those who deal with the subject itself and design the tools and methods) and those in other fields who pick up statistical tools from textbooks without really understanding them. For them it is a problem with statistical education and half-baked expertise. Alas, this category of blind users includes regulators and risk managers, whom I accuse of creating more risk than they reduce.

I wouldn’t go so far as Taleb: the masquerade also often comes from classical statistics and statisticians, too. Much of the statistical methods that are taught to non-statisticians had their origin in the early and middle part of the 20th century before there was access to computers. In those days, it was rational to make gross approximations, assume uncertainty could always be quantified by normal distributions, guess that everything was linear. These simplifications allowed people to solve problems by hand. And, really, there was no other way to get an answer without them.

But everything is now different. The math is new, our understanding of what probability is has evolved, and everybody knows what computers can do. So, naturally, what we teach has changed to keep pace, right?

Not even close to right. Except for the modest introduction of computers to read in canned data sets, classes haven’t change one bit. The old gross approximations still hold absolute sway. The programs on those computers are nothing more than implementations of the old routines that people did by hand—*many professors still require their students to compute statistics by hand!* Just to make sure the results match what the computer spits out.

It’s rare to find an ex-student of a statistics course who didn’t hate it (“You’re a statican [sic]? I always hated statistics!” they say brightly). But it’s just as rare to find a person who had, in the distant past, one of two courses who doesn’t fancy himself an expert (I can’t even list the number of medical journal editors who have told me my new methods were wrong). People get the idea that if they can figure out how to run the software, then they know all they need to.

Taleb makes the point that these users of packages necessarily take a too limited view of uncertainty. They seek out data that confirms their beliefs (this obviously is not confined to probability problems), fit standard distributions to them, and make pronouncements that dramatically underestimate the probability of rare events.

Many times rare events cause little trouble (the probability that you walk on a particular blade of grass is very low, but when that happens, nothing happens), but sometimes they wreak havoc of the kind happening now with Lehman Brothers, AIG, WAMU, and on and on. Here, Taleb starts to mix up estimating probabilities (the “inverse problem”) with risk in his “Four Quadrants” metaphor. The two areas are separate: estimating the probability of an event is independent of what will happen if that event obtains. There are ways to marry the two areas in what is called Decision Analysis.

That is a minor criticism, though. I appreciate Taleb’s empirical attempt at creating a list of easy to, hard to, and difficult to estimate events along with their monetary consequences should the events happen (I have been trying to build such a list myself). Easy to estimate/small consequence events (to Taleb) are simple bets, medical decisions, and so on. Hard to estimate/medium consequence events are climatological upsets, insurance, and economics. Difficult to estimate/extreme consequence events are societal upsets due to pandemics, leveraged portfolios, and other complex financial instruments. Taleb’s bias towards market events is obvious (he used to be a trader).

A difficulty with Taleb is that he writes poorly. His ideas are jumbled together, and it often appears that he was in such a hurry to gets the words on the page that he left half of them in his head. This is true for his books, too. His ideas are worth reading, however, though you have to put in some effort to understand him.

I don’t agree with some of his notions. He is overly swayed by “fractal power laws”. My experience is that people often see power laws where they are not. Power laws, and other fractal math, give appealing, pretty pictures that are too psychologically persuasive. That is a minor quibble. My major problem is philosophical.

Taleb often states that “black swans”, i.e. extremely rare events of great consequence, are impossible to predict. Then he faults people, like Ben Bernanke, for failing to predict them. Well, you can’t predict what is impossible to predict, no? Taleb must understand this, because he often comes back to the theme that people underestimate uncertainty of complex events. Knowing this, people should “expect the unexpected”, a phrase which is not meant glibly, but is a warning to “increase the area in the tails” of the probability distributions that are used to quantify uncertainty in events.

He claims to have invented ways of doing this using his fractal magic. Well, maybe he has. At the least, he’ll surely get rich by charging good money to learn how his system works.