Skip to content

Category: Class – Applied Statistics

August 28, 2017 | 9 Comments

Taleb’s Curious Views On Probability — Part I: Probability Does Not Exist

Ye Olde Statistician points us to an essay (a book chapter?) by our old pal Nassim Nicholas Taleb called “The Logic of Risk Taking“. Let’s examine it.

You, dear reader, do not have a probability of being flattened while crossing the street. Nobody does. Nobody has any probability of anything. Nothing has a probability of anything.

The reason is this (quoting de Finetti in word and typeface): PROBABILITY DOES NOT EXIST.

You cannot have in abundance or in fraction that which does not exist. Yet Taleb says, “the risk of being killed as a pedestrian is one per 47,000 years.” Ignoring the number, but the proposition itself will not sound wrong to most. It is wrong. Since probability does not exist, there is no blanket risk of you being killed as a pedestrian.

Probability, absolutely all of it all of the time, is conditional. You walk to a corner and desire to cross. At this point you must form premises on which to act. You might say, “I might get hit”, which adds nothing to your ability to form a probability of “I will get hit”. (This, and everything else, is proved in Uncertainty.)

You might instead think, “There are no cars coming anywhere”, and form a very low probability of “I will get hit”. Or you might say, “If I hurry, I can make it.” A higher probability.

Now suppose you are an actuarial (a statistician with less personality, as the joke goes) and want to guess how many pedestrians will go to their reward next year for having the audacity to cross the street. No easy job, that. Are you limiting this to the once United States? Everywhere? You still need premises to form probabilities of propositions like “There will be X killed”. Which premises?

Well, you might take the number flattened last year and use that as a base for some ad hoc model, which may or may not be useful in making predictions. You could form premises state-by-state, and then feed these into an ad hoc model. Or county-by-county. Or city-by-city. Or individual-by-individual.

Have the idea?

Change the premises, i.e. assumptions data and the like, and you change the probability. I don’t know what premises Taleb used to arrive at “one per 47,000 years”, but they must exist somewhere, at least in his imagination.

That probability depends on assumptions is the very point made in the last two articles discussing Taleb and the precautionary principle (here and here). Other words for assumptions and premises are model and theory.

Now suppose you meet the actuarial on his lunch hour and he tells you of his recent calculation, a model with various assumptions that led him to state “You’ll be dead street meat at the rate of one per 47,000 years”. This might form your new premise, from which you deduce (circularly) you have that chance of being killed.

When you get to the intersection, and you insist on using the actuarial’s number (he being an expert), it means ignoring all that is before you except that it is an intersection which you will cross. So if you live on a New York City avenue, it means ignoring that Access-a-Ride bus rocketing your direction towards the red light at which, by law, the texting wild-eyed driver must stop.

If you believe probability exists, and you believe an expert has discovered the probability for your particular situation, and Taleb is an expert, then ignoring circumstance is the rational thing to do; it is the only thing to do and stay consist with your belief probability exists.

Or you could chuck the idea that probability exists into the trash heap and hope the Access-a-Ride bus meets that perpendicular oncoming City bus and duels it out with him.

We need this demonstration probability does not exist as a baseline to discuss the remainder of Taleb’s article. The 47,000-year figure, for instance, comes from this:

About every time I discuss the precautionary principle, some overeducated pundit suggests that “we cross the street by taking risks”, so why worry so much about the system? This sophistry usually causes a bit of anger on my part. Aside from the fact that the risk of being killed as a pedestrian is one per 47,000 years, the point is that my death is never the worst case scenario unless it correlates to that of others.

I gather his over-educated pundit meant “We take risks by crossing the street”, which is true—but only on the premise that all actions possess a risk, that all risk is contingent.

I do not know if Taleb believes probability exists; he at times appears to imply it, at other times perhaps not. I’m not familiar enough with his writings to know if he has made a direct statement on the matter. So that if you love Taleb, there’s no reason to become upset with me.

More to come…

July 31, 2017 | 5 Comments

Example Of How To Eliminate P-values & Their Replacement

Many, many more details are available in Uncertainty: The Soul of Modeling, Probability & Statistics and at this page.

Last time we learned that the way to do probability models was this:

(1) Pr( Y | X D M )

where Y is the proposition of interest, X an assumption or supposed, D some past observations, and M a group of premises which comprise our model, propositions which “map” or relate the X and D to Y. Nearly always, M is not causal, merely correlational. Causal models are as rare [as me remembering to fill in a hilarious simile].

As a for-instance we assumed Y = “The patient improves”, and X_0 = “The old treatment”, X_1 = “The New & Improved! treatment.” D are a group of observations of treatment, whether the patient improved, and any number of things we think might be related in a correlational way to Y. By “correlational” way we mean something in the causal path, or a partial cause, or something related to a cause or partial cause. If we had the causes of Y, that would be our model, and we would, scientifically speaking, be done forevermore.

M is almost always ad hoc. The usual excuse is laziness. “We’re using logistic regression,” says researcher. Why? Because the people before him used logistic regression. M can be deduced in many cases, but it is hard, brutal work—though only because our best minds have not set themselves to creating a suite of these kinds of models as they have for parameter-centric models.

Parameters do not exist (parameters in a logistic regression relate the X to the Y, etc.). They are not ontic. Because M is ad hoc, parameters are ad hoc. Which is what makes the acrimony over “priors” on parameters so depressing. By the time we’ve reached thinking about priors, we are already two or three levels of ad hociness down the hole. What’s a little more?

As I say, M can be deduced, which means there are no parameters anywhere ever. But, as it is, we can “integrate them out”, and we must do so, because (again) parameters do not exist, and because certainty in some unobservable non-existant parameters in some ad hoc model do not, they most certainly do not, translate into certainty about Y. But, of course, everybody acts as if they do.

So our cry is not only “Death to P-Values!” but “Death to Parameters!”

If we are using a parameterized model, as all regression models are, the propositions about which priors we are using are just part of M; they are part of the overall ad hociness. Point is, our bookkeeping in (1) is complete.

Enough introduction. Let’s get down to a fictitious, wholly made up, imaginary example using our scenario.

M contains a list of correlates; these are the X (M is more than the X, of course). As is usual, we suppose there are p of them, i.e. X is the compound proposition X_1 & X_2 & … & X_p. Just to hammer home the point, ideally X are those observations which give the cause of Y. Barring that, they should be related to the cause or causes. Barring that, and as is most usual, X will be—can you guess?—ad hoc.

With so much ad hociness you might ask, “Why do people take statistical models so seriously?” And you would be right to ask that—just as you are right suspecting the correct answer to that question.

Anyway, suppose X_j = “Physician’s sock color is blue”, a 0-1 “variable”. We can then compute these two probabilities:

(1) Pr( Y | X D M ),

(2) Pr( Y | X_(-j) D_(-j) M_(-j) ) = Pr( Y | [X D M]_(-j) ).

Equation (1) is the “full” M, and eq. (2) is the model sans socks. Which of these two probabilities is the correct one?


Since all probability is conditional, and we pick the X and the X are not the causes, both probabilities are correct.

Suppose we observed (1) = 0.49876 and (2) = 0.49877. This means exactly what the equations say they mean. In (1), it is the probability the patient gets better assuming all the old data including physician sock color; in (2) it is the probability the patient improves assuming all data but socks. Both assume the model.

Now I ask you the following trick question, which will be very difficult for those brought up under classical statistics to answer: Is there is a difference between (1) and (2)?

The answer is yes. Yes, 0.49876 does not equal 0.49877. They are different.

Fine. Question two: is the difference of 0.00001 important?

The answer is there is no answer. Why? Because probability is not decision. To one decision maker, interested in statements about all of humanity, that difference might make a difference. To a second decision maker, that difference is no difference at all. Fellow number two drops socks from his model. The statistician has nothing to say about the difference, nor should he. The statistician only calculates the model. The decision maker uses it.

That’s it. That’s how all of statistics should work. There remains only one small thing to note about the Xs.

Which X?

It is this: unless we are dealing with causes, the list of X is infinite. Infinite is a big number. Who gets to decide which X to include and which to leave out? Who indeed. To include any X is to assume implicitly that there is a causal connection, however weak or distantly related, to Y. These implicit premises are in M, but of course are not written out. (The mistake most make is reification; the mathematical model becomes more important than reality.)

Sock color could be causally related, weakly and distantly, to patient health. It could be that more of those docs with blue socks wear manly shoes (i.e. leather) and since manly shoes cost more, some of these docs have more money, and perhaps one reason some of these docs have more money is because they are better docs and see more or wealthier patients.

You can always tell stories like this; indeed, you must, and you do. If you did not, you would have never put the X in the model in the first place. The most important thing to recognize is this: probability is utterly and forever silent on the veracity of any causal story (unless cause is complete and known). This is why hypothesis testing—p-values, Bayes factors, etc.—are always fallacious. They mix up probability with decision.

June 26, 2017 | 3 Comments

Free Statistics Class: Predictive Case Study 1, Part II


We began our first predictive analysis, and spent a lot of time with it. But we still haven’t got to the main question!

And that is how it should be.

Since the predictive method separates probability from decision and emphasizes decision, we should spend most of our time with defining the decision. The probability part will be easy, and is just math. So, as they say on the planes, but here I mean it, sit back, relax, and enjoy the flight.

When we left off, we were exploring CGPA. If a person was only taking one class, there were 14 possible CGPAs (0, 0.33, …, 4.33). Now, if all we knew was the scoring (grading) system and that person was taking just one class, then we deduce the probability (from the symmetry of logical constants leading to the statistical syllogism) of a CGPA of, e.g., 4 as 1/14—and the same for the other possibilities.

Because all probability is conditional on a specified list of premises, and only on that list, it’s well to be explicit. The probability CGPA equals, say, 0, needs givens. Those assumptions, premises, givens, truths, are the list itself, (0, 0.33, …, 4.33), the implicit premise that CGPA must be one of these; or the explicit premise there is only one class and explicit rules of the scoring system which together imply the list. Notice we do not allow an “incomplete”. Why not? Why not indeed? It is as assumption on our part and nothing else. If we assumed an incomplete, the probability changes (homework: how?). Remember: the all in “if all we knew…” is as rigorous as can be. We calculate the probability on these premises and none other. Probability is not subjective, except in the sense that we choose the premises: after the premises are chosen, probability is deduced.

If the person were taking 2 classes, there are 196 different possible grades (from 14^(number of classes); see this document on permutations), of which only 42 are unique (0, 0.165, 0.330, 0.335, …, 4.33). If all we knew were the scoring system and that there were 2 classes, the chance of a CGPA = 3 is 9/196 = 0.046. Use this self-explanatory R code to play (but don’t push r much beyond 5!; install gtools if you don’t have it; this code is not meant to be efficient, but explicative; if you can’t follow the code, don’t worry, just use it).

# possible grades; a premise set by us
s = c(0,.33, .67, 1, 1.33, 1.67, 2, 2.33, 2.67, 3, 3.33, 3.67, 4, 4.33)

r=2 # number of classes; another premise
result = as.matrix(expand.grid(lapply(numeric(r), function(x) s)), ncol=r)
cgpa = apply(result,1,function(x) sum(x)/r)

The table(cgpa) gives a count of possibilities for each CGPA; i.e. with r = 2 there is only 1 way to get 0, 2 ways to get 0.165, and so on.

Again, the “all” in the “all you know” cannot be stressed too highly. All probability is conditional on the information assumed, and only on that information, so the probabilities above are only valid assuming just the premises given and none other. In particular, it does not matter what you might know about a person and their study habits, or the school, or anything else. The probabilities are true given the premises. Whether these are the right premises for the question we want to answer is another question which we’ll explore — in depth — later.

Now, what if all we knew were the scoring system and that the person were going to take 1 or 2 classes? Suppose we’re interested in a CGPA of 3 again. If 1 class, the probability is 1/4; if 2 classes, it’s 9/196. And since we don’t know if 1 or 2 classes, we apply the statistical syllogism again, and deduce 1/2 * 1/4 + 1/2 * 9/196 = 0.148.

You can see that we in principle can derive exact answers—though the counting will grow difficult. For a “full load” of 12 classes, there are 14^12 possible grades (5.7e13), of which only about 1,000 are unique.

Two of these possibilities are 1.860833 and 1.861667. We could, of course, compute the probability of these CGPAs given the by-now usual premises. But is that what we want? Is this the decision? Compute the probability of barely distinguishable grade points?

It could be that we care about such small differences. If we do, then we have the apparatus to solve the problem. Not for incorporating SAT or HGPA or past observations yet, but for our “naked” premises. We’ll come to that other information in time. But let’s be clear what we’re trying to do first or we risk making all the usual mistakes.

Now, I do not care about such small differences, and neither do most people. I just do not want to differentiate (though I could if I wanted) between, e.g. 1.860833 and 1.861667. To the nearest, say, tenth place is good enough for the decision I want to make about CGPA. Yet small differences are important if our goal is ordering; if, say, we want to predict who has the highest or lowest CGPA and that kind of thing. We’re not doing that there. Our decision is quantifying uncertainty in CGPA for individual people and accuracy to the 6th decimal place isn’t that interesting to me — to you it might be.

We have a decision about our decision to make: keep the small differences, which carry computational burdens and produces not very interesting answers, or make an approximation. Pay attention here. Tradition (classical methods) approximates the finite discrete CGPA as a continuous number, usually on the real line, a.k.a. the continuum. This approximation is so common that few pause to think it is an approximation! But, of course, it is, and a crude one.

If this most important point has not sunk in, then stop and think on it.1

One difficulty with the traditional approximation is that it says the probability of any caCGPA (the “ca” prefix is for the continuous approximation) is 0, which is dissatisfying (the continuum is a strange place!). The benefit is that all sorts of canned software is ready for use, and the math is much easier. Whether these benefits are worth it is the point in question and cannot be assumed true in all problems.

Besides the continuum, another approximation is to compress CGPA. It is already finite and discrete: we keep that nature, but further reduce the level of detail. I don’t care about the differences between 1.860833 and 1.861667, but suppose I do care about the difference between 1 and 2, and between 2 and 3, and 3 and 4.

That is, one compression is to put CGPA on the set (0, 1, 2, 3, 4). There are no computational difficulties with such a small set; all probability statements based on it are readily calculated. Number of classes has much less effect on this set, too.

It’s a crude compression, true. Still, that doesn’t mean a useless one. It depends—as all things do—on the decisions I want to make with CGPA. If I’m a Dean of some sort (Heaven forfend), this compression may be perfect, and I can even consider going cruder, say, (0-2, 3-4).

Or again, it may be too crude at that. Maybe every tenth is more what I’m looking for, especially if I’m considering eligibility of some scholarship.

We’ll see what these approximations do next time.

I’ll answer all pertinent questions, but please look elsewhere on the site (or in Uncertainty) for criticisms of classical methods. Non-pertinent objections will be ignored.

1You may argue that CGPA is embedded (in some mathematical sense) in an infinite sequence, and thus CGPA would live on the continuum, and thus the continuous is no longer an approximation. Since probability is conditional, accepting this condition works in the math. But, of course, CGPA is not embedded in any infinite sequence. Nothing is, because nothing contingent is infinite. So we’re back to the continuous as an approximation.

June 21, 2017 | 6 Comments

Free Statistics Class: Predictive Case Study 1, Part I

Regular readers know Uncertainty proposes we go back to the old way of examining and making conclusions about data, and eschew many innovations of the 20th Century. No p-values, no tests, no posteriors. Just plain probability statements about observables and a rigorous separation of probability from decision.

These criticisms you know (or ought to by now). So why not let’s do a case study or three, and take our time doing so. Case Study 1 uses the same data presented in Uncertainty. We’re interested in quantifying our uncertainty in a person’s end-of-first-year College GPA given we know their SAT score, high school GPA, and perhaps another measure we might have.

Now right off, we know we haven’t a chance to discover the cause—actually causes—of a person’s CGPA. These are myriad. A GPA is comprised of scores/grades per class, and the causes of the score in each class are multitudinous. How much one drank the evening before a quiz, how many hours put in on a term paper, whether a particular book was available at a certain time, and on and on.

It is equally obvious a person’s HGPA or SAT does not and cannot cause a person’s CGPA. Some of the same causes responsible for the HGPA, SAT might appear in the list of causes for CGPA, but it’s a stretch to say they’re identical. We could say “diligence” or “sloth” are contributory causes, but since these cannot be quantified (even though some might attempt such a maneuver), they cannot take their place in a numerical analysis.

Which brings up the excellent question: why do a numerical analysis at all?

Do no skip lightly over this. For in that query is the foundation of all we’ll do. We’re doing a numerical, as opposed to the far more common qualitative (which form most of our judgments), study because we have in mind a decision we will make. Everything we do must revolve around that decision. Since, of course, different people will make different decisions, the method of analysis would change in each case.

It should be clear the decision we cannot make is about what causes CGPA. Nor can we decide how much “influence” SAT or HGPA has on CGPA, because “influence” is a causal word. We cannot “control” for SAT or HGPA on CGPA because, again, “control” is a causal word, and anyway HGPA and SAT were in no way caused, i.e. controlled, by any experimenter.

All we can do, then, if a numerical analysis is our goal, is to say how much our uncertainty in CGPA changes given what we know about SAT or HGPA. Anything beyond that is beyond the data we have in hand. And since we can make up causal stories until the trump of doom, we can always come up with a causal explanation for what we see. But our explanation could be challenged by somebody else who has their own story. Presuming no logical contradiction (say a theory insists SAT scores that we observed are impossible), our “data” would support all causal explanations.

This point is emphasized to the point we’re sick of hearing it because the classic way of doing statistics is saturated in incorrect causal language. We’re trying to escape that baggage.

So just what decision do I want to make about CGPA?

I could be interested in my own or in another individual’s. Let’s start with that by thinking what CGPA is. Well, it’s a score. Every class, in the fictional college we’re imagining, awards a numerical grade, F (= 0) up to A+ (A = 4, A+ = 4.33, and so on). CGPA = score per class divided by number of classes. That’s several numbers we need to know.

How many classes will there be? In this data, I don’t know. That is to say, I do not know the precise number for any individual, but I do know it must be finite. Experience (which is not part of the data) says it’s probably around 10-12 for a year. But who knows? We also can infer that each person has at least one class—but it could be that some have only one class. Again, who knows?

So number of classes is equal to or greater than one and finite. So, given the scoring system for grades, that means CGPA must be of finite precision. Suppose a person has only one class, then the list of possible CGPAs is 0, 0.33, …, 4, 4.33 and none other. If a person has two classes, then the possibilities are 0, 0.165, 0.33, and so forth. However many classes there are, the final list will be a discrete, finite set of possible CGPAs, which will be known to us given the premises about the grading system.

Suppose a student had 12 classes, then his score (CGPA) might be (say) 2.334167. That’s 7 digits of precision! This number is one of lots of different possible grades (these begin with 0, 0.0275, 0.055, 0.0825, …). And there is more than one way to get some of these grades. A person with a CGPA of 2 might have had 12 classes with all C’s (= 2), or 12 with half A’s and half F’s; and there are other combinations that lead to CGPA = 2. And so now we have to ask ourselves just what about the CGPA we want to know.

We’ve reached our first branching point! And the end of today’s lesson. See if you can guess where this is going.

I’ll answer all pertinent questions, but please look elsewhere on the site (or in Uncertainty) for criticisms of classical methods. Non-pertinent objections will be ignored.