Skip to content

Category: Class – Applied Statistics

January 9, 2018 | 21 Comments

Free Data Science Class: Predictive Case Study 1, Part VII


This is our last week of theory. Next week the practical side begins in earnest. However much fun that will be, and it will be a jolly time, this is the more important material.

Last time we learned the concept of irrelevance. A premise is irrelevant if when it is added to the model, the probability of our proposition of interest does not change. Irrelevance, like probability itself, is conditional. Here was our old example:

    (7a) Pr(CGPA = 4 | grading rules, old obs, sock color, math) = 0.05,
    (7c) Pr(CGPA = 4 | grading rules, old obs, math) = 0.05,

In the context of the premises “grading rules, old obs, math”, “sock color” was irrelevant because the probability of “CGPA = 4” did not change when adding it. It is not that sock color is unconditionally irrelevant. For instance, we might have

    (7d) Pr(CGPA = 3 | grading rules, old obs, sock color, math) = 0.10,
    (7e) Pr(CGPA = 3 | grading rules, old obs, math) = 0.12,

where now, given a different proposition of interest, sock color has become relevant. Whether it is useful is, and always will be, whether it is pertinent to any decisions we would make about CGPA = 3. We might also have:

    (7f) Pr(CGPA = 4 | grading rules, old obs, sock color) = 0.041,
    (7g) Pr(CGPA = 4 | grading rules, old obs) = 0.04,

where sock color becomes relevant to CGPA = 4 absent our math (i.e. model) assumptions. Again, all relevance is conditional. And all usefulness depends on decision.

Decision is not unrelated to knowledge about cause. Cause is not something to be had from probability models; it is something that comes before them. Failing to understand this is the cause (get it!) of confusion generated by p-values, hypothesis tests, Bayes factors, parameter estimates, and so on. Let’s return to our example:

    (7a) Pr(CGPA = 4 | grading rules, even more old obs, sock color, math) = 0.05,
    (7b) Pr(CGPA = 4 | grading rules, even old obs, math) = 0.051.

Sock color is relevant. But does sock color cause a change in CGPA? How can it? Doubtless we can think of a story. We can always think of a story. Suppose sock color indicates the presence of white or light colored socks (then, the absence of sock color from the model implies dark color or no hosiery). We might surmise light color socks reflect extra light in examination rooms, tiring the eyes of wearers so that they will be caused to miss questions slightly more frequently than their better apparelled peers.

This is a causal story. It might be true. You don’t know it isn’t. That is, you don’t know unless you understand the true cause of sock color on grades. And, for most of us, this is no causation at all. We can tell an infinite number of causal stories, all equally consistent with the calculated probabilities, in which sock color affects CGPA. There cannot be proof they are all wrong. We therefore have to use induction (see this article) to infer sock color by its nature is acausal (to grades). We must grasp the essence of socks and sock-body contacts. This is perfectly possible. But it is something we do beyond the probabilities, inferring from the particular observations to the universal truth about essence. Our comprehension of cause is not in the probabilities, nor in the observations, but in the intellectual leap we make, and must make.

This is why any attempt to harness observations to arrive at causal judgments must fail. Algorithms cannot leap into the infinite like we can. Now this is a huge subject, beyond that which we can prove in this lesson. In Uncertainty, I cover it in depth. Read the Chapter on Cause and persuade yourself of the claims made above, or accept them for the sake of argument here.

What follows is that any kind of hypothesis test (or the like) must be making some kind of error, because it is claiming to do what we know cannot be done. It is claiming to have identified a cause, or a cause-like thing, from the observations.

Now classical statistics will not usually say that “cause” has been identified, but it will always be implied. In a regression for Income on Sex, it will be claimed (say) “Men make more than women” based on a wee p-value. This implies sex causes income “gaps”. Or we might hear, if the researcher is trying to be careful, “Sex is linked to income”. “Linked to” is causal talk. I have yet to see any definition (and they are all usually long-winded) of “linked to” that did not, in the end, boil down to cause.

There is a second type of cause to consider, the friend-of-a-friend cause, or the cause of a cause (or of a cause etc.). It might not be that sock color causes CGPAs to change, but that sock color is associated with another cause, or causes, that do. White sock color sometimes, we might say to ourselves, is associated with athletic socks, and athletic socks are tighter fitting, and it’s this tight fit that causes (another cause) itchiness, and the itchiness sometimes causes distraction during exams. This is a loose causal chain, but an intact one.

As above, we can tell an infinite number of these cause-of-a-cause stories, the difference being that here it is much harder to keep track of the essences of the problem. Cause isn’t always so easy! Just ask physicists trying to measure effects of teeny weeny particles.

If we do not have, or can not form, a clear causal chain in our mind, we excuse ourselves by saying sock color is “correlated” or (again) “linked to” CGPA, with the understanding that cause is mixed in somehow, but we do not quite know how to say so, or at least not in every case. We know sock color is relevant (to the probability), but the only way we would keep it in the model, as said above, is if it is important to a decision we make.

Part of any decision, though, is knowledge of cause. If we knew the essences of socks, and the essence of all things associated with sock color, and we judge that these have no causal power to change CGPA, then it would not matter if there were any difference in calculated probabilities between (7a) and (7b). We would expunge sock color from our model. We’d reason that even a handful of beans tossed onto the floor can take the appearance of a President’s profile, but we’d know the pattern was in our minds and not caused intentionally by the bean-floor combination.

If we knew that, sometimes and in some but not necessarily all instances, that sock color is in the causal chain of CGPA (as in for instance tightness and itchiness) then we might include sock color in our model but only if it were important for decision.

If we ignorant (but perhaps only suspicious) of the causal chain of sock color, which for some observations in some models we will be, we keep the observation only if the decision would change.

Note carefully that it is only knowledge of cause or decision that lead to use accepting or rejecting any observable from our model. It has nothing to do (per se) with any function of measurements. Cause and decision are king in the predictive approach. Not blind algorithms.

In retrospect, this was always obvious. Even classical statisticians (and the researchers using these methods) do not put sock color into their models of grade point. Every model begins with excluding an infinity of non-causes, i.e. of observations that can be made but that are known to be causally irrelevant (if not probabilistically) irrelevant to the proposition of interest. Nobody questions this, nor should they. Yet to be perfectly consistent with classical theory, we’d have to try and “reject” the “null” hypotheses of everything under, over, around, and beyond the sun, before we were sure we found the “true” model.

Lastly, as said before and just as obvious, if we knew the cause of Y, we don’t need probability models.

Next week: real practical examples!

Homework I do not expect to “convert” those trained in classical methods. These fine folks are too used to the language in those methods to switch easily to this one. All I can ask is that people read Uncertainty for a fuller discussion of these topics. The real homework is to find an example of or try to define “linked to” without resorting somewhere to causal language.

Once you finish that impossible task, find a paper that says its results (at least in part) were “due to” chance. Now “due to” is also causal language. Given that chance is only a measure of ignorance, and therefore cannot cause anything, and using the beans-on-floor example above, explain what it is people are doing saying results were “due to” chance.

December 20, 2017 | 5 Comments

Cliodynamics And The Lack Of A Hari Seldon

There will be no Hari Seldon. But there will be prophets.

If there is no Seldon, there will be no psychohistory, the fictional astonishingly accurate mathematical science predicting gross human movements Isaac Asimov created for his Foundation novels.

Seldon and his followers were supposed to have discovered mathematical tricks that turned history into a science. Input certain measures and out come trajectories which are not certain but close to it, especially as the number of people increase.

These same occult magic tricks are searched for in reality by any number of folks with access to a computer. On the one hand are the “artificial intelligence” set who believe, falsely, that human intelligence “has” an equation. These people confess not knowing Seldon’s equations, but are sure their well greased abacuses will find them once the number of wooden rods and beads become sufficiently dense. For a comparison of wooden abacus to electronic computer, see this series.

On the other hand are those who might be classed as analytic historians. They’ve invented for themselves “cliodynamics” which is, according to Wikipedia, “a transdisciplinary area of research integrating cultural evolution, economic history/cliometrics, macrosociology, the mathematical modeling of historical processes during the longue durée, and the construction and analysis of historical databases.” Nice boast!

One cliodynamiticist is Peter Turchin, “an evolutionary anthropologist at the University of Connecticut and Vice President of the Evolution Institute”, who input the article “Entering the Age of Instability after Trump: Why social instability and political violence is predicted to peak in the 2020s.”

Turchin predicts a coming doom, a not unfamiliar theme to regular readers. He says he’s tracking “40 seemingly disparate…social indicators” which are “leading indicators of political turmoil”. He predicts peak turmoil in the 2020s. Which is close.

Some of his indicators: “growing income and wealth inequality, stagnating and even declining well-being of most Americans, growing political fragmentation and governmental dysfunction”, all well known, too, as Turchin admits. He pegs “elite overproduction” as the unsung measure of doom.

Elite overproduction generally leads to more intra-elite competition that gradually undermines the spirit of cooperation, which is followed by ideological polarization and fragmentation of the political class. This happens because the more contenders there are, the more of them end up on the losing side. A large class of disgruntled elite-wannabes, often well-educated and highly capable, has been denied access to elite positions.

This exists, but its importance is unknown. That we have lost the story and have turned inward and truly self-centered might have more destructive force. That, and our elites have largely lost their minds. All crises are spiritual crises. Whoever wins this coming war will be the greater spiritual force.

Turchin’s language is saturated in Seldonism.

I find myself in the shoes of Hari Seldon, a fictional character in Isaac Asimov’s Foundation, whose science of history (which he called psychohistory) predicted the decline and fall of his own society. Should we follow Seldon’s lead and establish a Cliodynamic Foundation somewhere in the remote deserts of Australia?

This would be precisely the wrong thing to do. It didn’t work even in Isaac Asimov’s fictional universe. The problem with secretive cabals is that they quickly become self-serving, and then mire themselves in internecine conflict. Asimov came up with the Second Foundation to watch over the First. But who watches the watchers? In the end it all came down to a uniquely powerful and uniquely benevolent super-robot, R. Daneel Olivaw.

Don’t wait up for telepathic robots to save civilization (as the abacus article argues).

Another important consideration is that in Foundation Seldon’s equations told him that it would be impossible to stop the decline of the Galactic Empire—Trantor must fall. In real life, thankfully, things are different. And this is another way in which the forecasts of cliodynamics differ from prophecies of doom. They give us tools not only to understand the problem, but also potentially to fix it.

But to do it, we need to develop much better science. What we need is a nonpolitical, indeed a fiercely non-partisan, center/institute/think tank that would develop and refine a better scientific understanding of how we got into this mess; and then translate that science into policy to help us get out of it.

Brother Turchin, it ain’t gonna happen. Empires fall. None yet has found the solution to eternal life. I don’t usually say this, but, Brother, trust your equations. Creating yet another think tank that issues policy reports is foredoomed. Save your time and money.

If there is any hope, and there always is, it is in a spiritual regeneration. Making that happen is not so easy.

December 19, 2017 | 5 Comments

Free Data Science Class: Predictive Case Study 1, Part VI


This class is neither frequentist nor Bayesian nor machine learning in theory. It is pure probability, and unlike any other class available. And for the right price!

Last time we completed this model:

    (5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

What we meant by “fixed math notions” gave us the multinomial posterior predictive, from which we made probabilistic predictions of new observables. Other ideas of “fixed math notions” would, of course, give us different models, and possibly different predictions. If we instead started from knowledge only of measurement, and grading rules, we could have deduced a model for new observables, too. This is done in Uncertainty. But the results won’t, in this very simple case for our good-sized n, be much different.

We next want to add other measurements to the mix. Besides CGPA, we also measured High School GPA, SAT scores (I believe these are in some old format; the data you will recall is very old and on an unknown source), and hours spent studying for the week. We want to construct models like this:

    (7) Pr(CGPA = 4 | grading rules, old observables, old correlates, math notions),

where “old observables” are measures CGPA and “old correlates” are measures of things we think are “correlated” with the observable of interest.

This brings us to our next and most crucial questions. What is a “correlate” and why are we putting them in our models? Don’t we need to test the hypotheses, via wee p-values or Bayes factors, that these correlates are “significantly” “linked” to the observable? What about “chance”?

Here is the weakest point of classical statistics. Now we have no chance here of having a complete discussion of the meaning and answers of these questions. We’ll have a go, but the depth will be unsatisfactory. All I can do it point to Uncertainty, and to other articles on the subject, and hope the introduction here is sufficient to progress.

What many are after can’t be had. The information about why a correlate is important is not in the data, i.e. the measurements of the correlate itself. Because of this, no mathematical function of the data can tell us about importance, either. Importance is outside the measured data, as we shall see. Usefulness is another matter.

Under strict probability, which is the method we are using, a “correlate” is any measure of bit of evidence you put on the right hand side. Here is where ML/AI techniques also excel. For instance, a correlate might be, “sock color of student worn on their third day of class.” With that, we can calculate (7).

Suppose we calculate these:

    (7a) Pr(CGPA = 4 | grading rules, old obs, sock color, math) = 0.05,
    (7b) Pr(CGPA = 4 | grading rules, old obs, math) = 0.05,

and the same for every values of CGPA (here we only have 5 possibly values, 0-4, but what is said counts for however we classify the observable). I mean, the prediction is the same (exactly identical) probability whether or not we include sock color, then in this model in this context and given these old obs, the sock color is irrelevant to the uncertainty in CGPA.

If we change anything on the right hand sides of (7a) or (7b) such we get

    (7a) Pr(CGPA = 4 | grading rules, even more old obs, sock color, math) = 0.05,
    (7b) Pr(CGPA = 4 | grading rules, even old obs, math) = 0.051,

then sock color is relevant to our uncertainty in CGPA. Relevance, then, is a conditional measure, just as probability is. Any difference (to withing machine floating-point round off!) in probabilities for any CGPA (with these givens), then sock color is relevant.

Irrelevance is, as you can imagine, hard to come by. Even a cloud, made up of water and cloud condensation nuclei, can resemble a duck, even though the CCN have no artistic intentions. As for importance, that’s entirely different.

Would you, as Dean (recall we are a college dean), make any different decision given (7a) = 0.05 and (7b) = 0.051? (You have to also consider all the other values of CGPA you said were important, and at least one other value will differ by at least 0.01.) If so, then sock color is useful. If not, then sock color is useless. Or of no use. Even though it is, strictly speaking, relevant.

Think about this decision. Think very hard. The decision you make might be different than the decision somebody else makes. The model (7a) may be useless to you and useful to somebody else.

And then you think to yourself, “You know, that 0.01 can make a big difference when I consider tens of thousands of students” (maybe this is a big state school). So (7a) becomes interesting.

Well, how much would it cost to measure the sock color of every student on the third day of their class? It can be done. But would it be worth it? And you have to know it if you use (7a) instead of (7b). It’s a requirement. Besides, if students knew about the measurement, and they caught wind that, say, red colors have higher probabilities of large CGPA than any other color, wouldn’t they, being students and by definition ignorant, wear red on that important day? That would throw off the model. (Answering why we do next time.)

Now if you dismiss this example as fanciful and thus not interesting, you have failed to understand the point. For it is the cost and consequences of the decisions you make that decide whether a relevant “variable” is useful. (Irrelevant “variables” are useless by definition.) We must always keep this in mind. The examples coming will make this concept sharper.

“But, Briggs, what could sock color have to do with CGPA?”

Sounds like you’re asking a question about cause. Let’s save that for next time.

It’s Christmas Break! Class resumes on 9 January 2018.

December 13, 2017 | 7 Comments

What Is The Probability Of COVFEFE?

From a tweet from Taleb, who informs us the following question is part of the Indian Statistical Institute examination.

(5) Mr.Trump decides to post a random message on Facebook and he starts typing a random sequence of letters {Uk}k≥1 such that they are chosen independently and uniformly from the 26 possible english alphabets. Find out the expected time of the first appearance of the word COVFEFE.

Now it is too good to check whether this is really used by the ISI, but I hope it is. It is too delicious. (Yes, it was Twitter, not Facebook.)

Regular readers will recall we had a Covfefe Sing-along after Trump’s masterly tweet.

The night Donald Trump took to Twitter
Elites had a terrible fit
Trump warned the world of covfefe
And Tweet streams were filled up with sh—

—Shaving cream.
Be nice and clean.
Shave everyday and
you’ll always look keen…

The ISI’s COVFEFE problem has much to recommend it, because it chock full of the language of modern probability that is so confusing. (Even my title misleads! Nothing “has” a probability!)

Now I learned my math from physicists, who do things to equations that make mathematicians shudder, but which are moves that are at least an attempt to hew to reality. There isn’t anything wrong with mathematician math, but the temptation to the Deadly Sin of Reification can be overwhelming. And why all those curly brackets? They intimidate.

I still recall in a math course struggling with some higher-order proofs from Billingsley (a standard work on mathematical probability) when a Russian mathematician made everything snap into clarity when he told me X, the standard notation for a “random variable” which all the books said “had” a distribution, “was a function”, whereas as a physicist I always saw it as an observable or proposition. It can, of course, be both, but if you ever want to apply the math, it is a proposition.

So here is Trump typing. What does it mean—think like a physicist and not a mathematician—to “independently and uniformly” choose letters? To choose requires a method of choosing. Some thing or things are causing the characters to appear on the screen. What? Trump closing his eyes and smacking his hands into the keys? Maybe. But, if so, then we have no hope of identifying the causes of what appears. If we don’t know the causes, we can’t answer how long it will take. We can’t solve the problem.

Enter probability, which can’t answer the question, but can answer similar ones, like “Given certain assumptions, what are the chances it takes X seconds?”

Since all probability is conditional on the assumptions made, the assumptions matter. What are they?

Choosing letters “independently” is causal language. “Uniformly” insists the probability of every letter being typed is equal, a circular definition, since what we want to know is the probability. Say instead “There are 26 letters, one of which must be typed once per time unit t, where knowledge of the letters typed previously tell us nothing about letters to be typed.”

Since COVFEFE (we’re working with all caps via the information given) is 7 letters, we want to characterize the uncertainty in the total time it takes to type this sequence.

Do we have all we need? Not quite. Again, think like a physicist and not a mathematician. How long is Trump going to sit at the computer? (Or play with his Portable Thinking Suppression Device (PTSD)?) It can’t be forever. That means there should be a chance we never see COVFEFE. On the other hand, if we assume Trump types forever, then it is obvious that not only must COVFEFE appear, but it must appear an infinite number of times!

Indeed, if we allow the mathematical possibility of eternal typing, not only will COVFEFE appear in infinite plenitude, Trump will also type the entire works of Shakespeare, not just once, but also an infinite number of times. And the entire corpus of all works that can be types in 26 letters sans spacing. Trump’s a genius!

Well that escalated quickly. That’s because The Limit is a bizarre place. Our intuition breaks down.

We still have to decide how fast Trump can type. Maybe two to five letters per second, but not faster than that. But that’s the physicist in me speaking. Keyboards and fingers can’t be engineered for infinitely fast typing. A mathematician might allow one character per infinitesimal time unit. If so, we have another infinity that has crept in. If one infinity was weird, trying mixing two.

Point is, since probability needs assumptions, we need to make explicit all of them. The problem doesn’t do that. We have to bring our knowledge of English grammar to bear, which we always do, and which part of the conditions. It will be no surprise people can come to different answers.

Homework: Assume finite time in which to type, and discrete positive real time to type each letter; assume also the simple characters proposition I gave and then calculate the probability of COVFEFE at t = 0, 1, 2, … n typing time units (notice this adds the assumption that letters come regularly with no variation, another mathematical, non-physical assumption). And then calculate the first appearance by t = 0, 1, 2, … n. Then calculate the expected value (is it even interesting?). After you have that, what happens in n goes to infinity? (It that even interesting?) And can you also have the time unit decrease to the infinitesimal?

Hint. The probability of seeing COVFEFE and not seeing COVFEFE must sum to 1. If n = 1, the (conditional on all these assumptions) probability of COVFEFE is 0, and not-COVFEFE is 1. Same with n = 2, 3, 4, 5, and 6. What about n = 7? And so on?