Skip to content

Category: Class – Applied Statistics

January 16, 2018 | 4 Comments

Free Data Science Class: Predictive Case Study 1, Part VIII


We’re continuing with the CGPA example. The data is on line, and of unknown origin, but good enough to use as an example.

We will build a correlational model, keeping ever in mind this model’s limitations. It can say nothing about cause, for instance.

As we discussed in earlier lessons, the model we will build is in reference to the decisions we will make. Our goal in this model is to make decisions regarding future students’ CGPAs given we have guesses or know their HGPA, SAT, and possibly hours spend studying. We judge at least the first two in the causal path of CGPA. Our initial decision cares about getting CGPA to the nearest point (if you can’t recall why this is most crucially important — review!).

It would be best if we extended our earlier measurement-deduced model, so that we have the predictive model from the get go (if you do not remember what this means — review!). But that’s hard, and we’re lazy. So we’ll do what everybody does and use an ad hoc parameterized model, recognizing that all parameterized models are always approximations to the measurement reality.

Because this is an ad hoc parameterized model, we have several choices. Every choice is in response to a premise we have formed. Given “I quite like multinomial logistic regression; and besides, I’ve seen it used before so I’m sure I can get it by an editor”, then the model is in our premises. All probability follows on our assumptions.

Now the multinomial logistic regression forms a parameter for every category—here we have 5, for CGPA = 0-5—and says those parameters are functions of parameterized measurements in a linear way. The math of all this is busy, but not too hard. Here is one source to examine the model in detail.

For instance, the parameter for CGPA = 0 is itself said to be a linear function of parameterized HGPA and SAT.

These parameters do not exist, give no causal information, and are of no practical interest (no matter how interesting they are mathematically). For instance, they do not appear in what we really want, which is this:

    (8) Pr(CGPA = i | guesses of new measures, grading rules, old obs, model), where i = 0,…,5.

We do not care about the parameters, which are only mathematical entities needed to get the model to work. But because we do not know the value of the parameters, the uncertainty in them, as it were, has to be specified. That is, a “prior” for them must be given. If we choose one prior, (8) will given one answer; if we choose a different prior, (8) will (likely) give a different answer. Same thing if we choose a different parameterized model: (8) will give different answers. This does not worry us because we remember all probability is conditional on the assumptions we make. CGPA does not “have” a probability! Indeed, the answers (8) gives using different models are usually much more varied than the answers given using the same model but different priors.

What prior should we use? Well, we’re lazy again. We’ll use whatever the software suggests, remembering other choices are possible.

Why not use the MNP R Package for “Fitting the Multinomial Probit Model”? But, wait. Probit is not the same as Logit. That’s true, so let’s update our ad hoc premise to say we really had in mind a multinomial probit model. If you do not have MNP installed, use this command, and follow the subsequent instructions about choosing a mirror.

install.packages('MNP', dependencies = TRUE)

There are other choices beside MNP, but unfortunately the software for multinomial regressions is not nearly as developed and as bullet proof as for ordinary regressions. MNP gives the predictive probabilities we want. But we’ll see that it can break. Beside that, our purpose is to understand the predictive philosophy and method, not to tout for a particular ad hoc model. What happens below goes for any model that can be put in the form of (8). This includes all machine learning, AI, etc.

The first thing is to ensure you have downloaded the data file cgpa.csv, and also the helper file briggs.class.R, which contains code we’ll use in this class. Warning: this file is updated frequently! For all the lawyers, I make no guarantee about this code. It might even destroy your computer, cause your wife to leave you, and encourage your children to become lawyers. Use at your own risk. Ensure Windows did not change name of cgpa.csv to cgpa.csv.txt.

Save the files in a directory you create for the class. We’ll store that directory in the variable path. Remember, # comments out the rest of what follows on a line.

path = 'C:/Users/yourname/yourplace/' # for Windows
#path = '/home/yourname/yourplace/' # for Apple, Linux
# find the path to your file by looking at its properties
# everything in this class is in the same directory

source(paste(path,'briggs.class.R',sep='')) # runs the class code
x = read.csv(paste(path,'cgpa.csv',sep='')) 
 x$cgpa.o = x$cgpa # keeps an original copy of CGPA
 x$cgpa = as.factor(roundTo(x$cgpa,1)) # rounds to nearest 1

You should see this:

>  summary(x)
 cgpa        hgpa            sat           recomm          cgpa.o     
 0: 4   Min.   :0.330   Min.   : 400   Min.   : 2.00   Min.   :0.050  
 1:17   1st Qu.:1.640   1st Qu.: 852   1st Qu.: 4.00   1st Qu.:1.562  
 2:59   Median :1.930   Median :1036   Median : 5.00   Median :1.985  
 3:16   Mean   :2.049   Mean   :1015   Mean   : 5.19   Mean   :1.980  
 4: 4   3rd Qu.:2.535   3rd Qu.:1168   3rd Qu.: 6.00   3rd Qu.:2.410  
        Max.   :4.250   Max.   :1500   Max.   :10.00   Max.   :4.010  
> table(x$cgpa)

 0  1  2  3  4 
 4 17 59 16  4 

The measurement recomm we’ll deal with later. Next, the model.

require(MNP) # loads the package

fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=2000)
#fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=10000)

The model call is obvious enough, even if burnin = 2000, n.draws=2000 is opaque.

Depending on your system, the model fit might break. You might get an odd error message ("TruncNorm: lower bound is greater than upper bound") about inverting a matrix which you can investigate if you are inclined (the problem is in a handful of values in sat, and how the model starts up). This algorithm uses MCMC methods, and therefore cycles through a loop of size n.draws. All we need to know about this (for now) is that because this is a numerical approximation, larger numbers give less sloppy answers. Try n.draws=10000, or even five times that, if your system allows you to get away with it. The more you put, the longer it takes.

We can look at the output of the model like this (this is only a partial output):

> summary(fit)

mnp(formula = cgpa ~ sat + hgpa, data = x, n.draws = 50000, burnin = 2000)

                    mean       2.5%  97.5%
(Intercept):1 -1.189e+00  2.143e+00 -7.918e+00  0.810
(Intercept):2 -1.003e+00  1.709e+00 -5.911e+00  0.664
(Intercept):3 -8.270e+00  3.903e+00 -1.630e+01 -1.038
(Intercept):4 -2.297e+00  3.369e+00 -1.203e+01 -0.003
sat:1          9.548e-04  1.597e-03 -3.958e-04  0.006
sat:2          1.065e-03  1.488e-03 -7.126e-06  0.005
sat:3          4.223e-03  2.655e-03  2.239e-05  0.010
sat:4          1.469e-03  2.202e-03  1.704e-06  0.008
hgpa:1         9.052e-02  3.722e-01 -5.079e-01  0.953
hgpa:2         1.768e-01  3.518e-01 -2.332e-01  1.188
hgpa:3         1.213e+00  6.610e-01  1.064e-01  2.609
hgpa:4         3.403e-01  5.242e-01 -7.266e-04  1.893

The Coefficients are the parameters spoken of above. The mean etc. are the estimates of these unobservable, not-very-interesting entities. Just keep in mind that because a coefficient is large, does not mean its effect on the probability of CGPA = i is itself large.

We do care about the predictions. We want (8), so let's get it. Stare at (8). On the right hand side we need to guess values of SAT and HGPA for a future student. Let's do that for two students, one with a low SAT and HGPA, and another with high values. You shouldn't have to specify values of CGPA, since these are what we are predicting, but that's a limitation of this software.

y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4))
a=predict(fit, newdata = y, type='prob')$p

The syntax is decided by the creators of the MNP package. Anyway, here's what I got. You will NOT see the exact same numbers, since the answers are helter-skelter numerical approximations, but you'll be close.

> a
            0          1         2      3            4
[1,] 0.519000 0.24008333 0.2286875 0.0115 0.0007291667
[2,] 0.000125 0.04489583 0.1222917 0.6900 0.1426875000

There are two students, so two rows of predictions for each of the five categories. This says, for student (sat=400, hgpa=1), he'll most like see a CGPA = 0. And for (sat=1500, hgpa=4), the most likely is a CGPA = 3. You can easily play with other scenarios. But, and this should be obvious, if (8) was our goal, we are done!

Next time we'll build on the scenarios, explore this model in more depth, and compare our model with classical ones.

Homework Play with other scenarios. Advanced students can track down the objectionable values of sat that cause grief in the model fit (I wrote a script to do this, and known which ones they are). Or they can change the premises, by changing the starting values of the parameters. We didn't do that above, because most users will never do so, relying on the software to work "automatically".

The biggest homework is to think about the coefficients with respect to the prediction probabilities. Answer below!

January 9, 2018 | 21 Comments

Free Data Science Class: Predictive Case Study 1, Part VII


This is our last week of theory. Next week the practical side begins in earnest. However much fun that will be, and it will be a jolly time, this is the more important material.

Last time we learned the concept of irrelevance. A premise is irrelevant if when it is added to the model, the probability of our proposition of interest does not change. Irrelevance, like probability itself, is conditional. Here was our old example:

    (7a) Pr(CGPA = 4 | grading rules, old obs, sock color, math) = 0.05,
    (7c) Pr(CGPA = 4 | grading rules, old obs, math) = 0.05,

In the context of the premises “grading rules, old obs, math”, “sock color” was irrelevant because the probability of “CGPA = 4” did not change when adding it. It is not that sock color is unconditionally irrelevant. For instance, we might have

    (7d) Pr(CGPA = 3 | grading rules, old obs, sock color, math) = 0.10,
    (7e) Pr(CGPA = 3 | grading rules, old obs, math) = 0.12,

where now, given a different proposition of interest, sock color has become relevant. Whether it is useful is, and always will be, whether it is pertinent to any decisions we would make about CGPA = 3. We might also have:

    (7f) Pr(CGPA = 4 | grading rules, old obs, sock color) = 0.041,
    (7g) Pr(CGPA = 4 | grading rules, old obs) = 0.04,

where sock color becomes relevant to CGPA = 4 absent our math (i.e. model) assumptions. Again, all relevance is conditional. And all usefulness depends on decision.

Decision is not unrelated to knowledge about cause. Cause is not something to be had from probability models; it is something that comes before them. Failing to understand this is the cause (get it!) of confusion generated by p-values, hypothesis tests, Bayes factors, parameter estimates, and so on. Let’s return to our example:

    (7a) Pr(CGPA = 4 | grading rules, even more old obs, sock color, math) = 0.05,
    (7b) Pr(CGPA = 4 | grading rules, even old obs, math) = 0.051.

Sock color is relevant. But does sock color cause a change in CGPA? How can it? Doubtless we can think of a story. We can always think of a story. Suppose sock color indicates the presence of white or light colored socks (then, the absence of sock color from the model implies dark color or no hosiery). We might surmise light color socks reflect extra light in examination rooms, tiring the eyes of wearers so that they will be caused to miss questions slightly more frequently than their better apparelled peers.

This is a causal story. It might be true. You don’t know it isn’t. That is, you don’t know unless you understand the true cause of sock color on grades. And, for most of us, this is no causation at all. We can tell an infinite number of causal stories, all equally consistent with the calculated probabilities, in which sock color affects CGPA. There cannot be proof they are all wrong. We therefore have to use induction (see this article) to infer sock color by its nature is acausal (to grades). We must grasp the essence of socks and sock-body contacts. This is perfectly possible. But it is something we do beyond the probabilities, inferring from the particular observations to the universal truth about essence. Our comprehension of cause is not in the probabilities, nor in the observations, but in the intellectual leap we make, and must make.

This is why any attempt to harness observations to arrive at causal judgments must fail. Algorithms cannot leap into the infinite like we can. Now this is a huge subject, beyond that which we can prove in this lesson. In Uncertainty, I cover it in depth. Read the Chapter on Cause and persuade yourself of the claims made above, or accept them for the sake of argument here.

What follows is that any kind of hypothesis test (or the like) must be making some kind of error, because it is claiming to do what we know cannot be done. It is claiming to have identified a cause, or a cause-like thing, from the observations.

Now classical statistics will not usually say that “cause” has been identified, but it will always be implied. In a regression for Income on Sex, it will be claimed (say) “Men make more than women” based on a wee p-value. This implies sex causes income “gaps”. Or we might hear, if the researcher is trying to be careful, “Sex is linked to income”. “Linked to” is causal talk. I have yet to see any definition (and they are all usually long-winded) of “linked to” that did not, in the end, boil down to cause.

There is a second type of cause to consider, the friend-of-a-friend cause, or the cause of a cause (or of a cause etc.). It might not be that sock color causes CGPAs to change, but that sock color is associated with another cause, or causes, that do. White sock color sometimes, we might say to ourselves, is associated with athletic socks, and athletic socks are tighter fitting, and it’s this tight fit that causes (another cause) itchiness, and the itchiness sometimes causes distraction during exams. This is a loose causal chain, but an intact one.

As above, we can tell an infinite number of these cause-of-a-cause stories, the difference being that here it is much harder to keep track of the essences of the problem. Cause isn’t always so easy! Just ask physicists trying to measure effects of teeny weeny particles.

If we do not have, or can not form, a clear causal chain in our mind, we excuse ourselves by saying sock color is “correlated” or (again) “linked to” CGPA, with the understanding that cause is mixed in somehow, but we do not quite know how to say so, or at least not in every case. We know sock color is relevant (to the probability), but the only way we would keep it in the model, as said above, is if it is important to a decision we make.

Part of any decision, though, is knowledge of cause. If we knew the essences of socks, and the essence of all things associated with sock color, and we judge that these have no causal power to change CGPA, then it would not matter if there were any difference in calculated probabilities between (7a) and (7b). We would expunge sock color from our model. We’d reason that even a handful of beans tossed onto the floor can take the appearance of a President’s profile, but we’d know the pattern was in our minds and not caused intentionally by the bean-floor combination.

If we knew that, sometimes and in some but not necessarily all instances, that sock color is in the causal chain of CGPA (as in for instance tightness and itchiness) then we might include sock color in our model but only if it were important for decision.

If we ignorant (but perhaps only suspicious) of the causal chain of sock color, which for some observations in some models we will be, we keep the observation only if the decision would change.

Note carefully that it is only knowledge of cause or decision that lead to use accepting or rejecting any observable from our model. It has nothing to do (per se) with any function of measurements. Cause and decision are king in the predictive approach. Not blind algorithms.

In retrospect, this was always obvious. Even classical statisticians (and the researchers using these methods) do not put sock color into their models of grade point. Every model begins with excluding an infinity of non-causes, i.e. of observations that can be made but that are known to be causally irrelevant (if not probabilistically) irrelevant to the proposition of interest. Nobody questions this, nor should they. Yet to be perfectly consistent with classical theory, we’d have to try and “reject” the “null” hypotheses of everything under, over, around, and beyond the sun, before we were sure we found the “true” model.

Lastly, as said before and just as obvious, if we knew the cause of Y, we don’t need probability models.

Next week: real practical examples!

Homework I do not expect to “convert” those trained in classical methods. These fine folks are too used to the language in those methods to switch easily to this one. All I can ask is that people read Uncertainty for a fuller discussion of these topics. The real homework is to find an example of or try to define “linked to” without resorting somewhere to causal language.

Once you finish that impossible task, find a paper that says its results (at least in part) were “due to” chance. Now “due to” is also causal language. Given that chance is only a measure of ignorance, and therefore cannot cause anything, and using the beans-on-floor example above, explain what it is people are doing saying results were “due to” chance.

December 20, 2017 | 5 Comments

Cliodynamics And The Lack Of A Hari Seldon

There will be no Hari Seldon. But there will be prophets.

If there is no Seldon, there will be no psychohistory, the fictional astonishingly accurate mathematical science predicting gross human movements Isaac Asimov created for his Foundation novels.

Seldon and his followers were supposed to have discovered mathematical tricks that turned history into a science. Input certain measures and out come trajectories which are not certain but close to it, especially as the number of people increase.

These same occult magic tricks are searched for in reality by any number of folks with access to a computer. On the one hand are the “artificial intelligence” set who believe, falsely, that human intelligence “has” an equation. These people confess not knowing Seldon’s equations, but are sure their well greased abacuses will find them once the number of wooden rods and beads become sufficiently dense. For a comparison of wooden abacus to electronic computer, see this series.

On the other hand are those who might be classed as analytic historians. They’ve invented for themselves “cliodynamics” which is, according to Wikipedia, “a transdisciplinary area of research integrating cultural evolution, economic history/cliometrics, macrosociology, the mathematical modeling of historical processes during the longue durée, and the construction and analysis of historical databases.” Nice boast!

One cliodynamiticist is Peter Turchin, “an evolutionary anthropologist at the University of Connecticut and Vice President of the Evolution Institute”, who input the article “Entering the Age of Instability after Trump: Why social instability and political violence is predicted to peak in the 2020s.”

Turchin predicts a coming doom, a not unfamiliar theme to regular readers. He says he’s tracking “40 seemingly disparate…social indicators” which are “leading indicators of political turmoil”. He predicts peak turmoil in the 2020s. Which is close.

Some of his indicators: “growing income and wealth inequality, stagnating and even declining well-being of most Americans, growing political fragmentation and governmental dysfunction”, all well known, too, as Turchin admits. He pegs “elite overproduction” as the unsung measure of doom.

Elite overproduction generally leads to more intra-elite competition that gradually undermines the spirit of cooperation, which is followed by ideological polarization and fragmentation of the political class. This happens because the more contenders there are, the more of them end up on the losing side. A large class of disgruntled elite-wannabes, often well-educated and highly capable, has been denied access to elite positions.

This exists, but its importance is unknown. That we have lost the story and have turned inward and truly self-centered might have more destructive force. That, and our elites have largely lost their minds. All crises are spiritual crises. Whoever wins this coming war will be the greater spiritual force.

Turchin’s language is saturated in Seldonism.

I find myself in the shoes of Hari Seldon, a fictional character in Isaac Asimov’s Foundation, whose science of history (which he called psychohistory) predicted the decline and fall of his own society. Should we follow Seldon’s lead and establish a Cliodynamic Foundation somewhere in the remote deserts of Australia?

This would be precisely the wrong thing to do. It didn’t work even in Isaac Asimov’s fictional universe. The problem with secretive cabals is that they quickly become self-serving, and then mire themselves in internecine conflict. Asimov came up with the Second Foundation to watch over the First. But who watches the watchers? In the end it all came down to a uniquely powerful and uniquely benevolent super-robot, R. Daneel Olivaw.

Don’t wait up for telepathic robots to save civilization (as the abacus article argues).

Another important consideration is that in Foundation Seldon’s equations told him that it would be impossible to stop the decline of the Galactic Empire—Trantor must fall. In real life, thankfully, things are different. And this is another way in which the forecasts of cliodynamics differ from prophecies of doom. They give us tools not only to understand the problem, but also potentially to fix it.

But to do it, we need to develop much better science. What we need is a nonpolitical, indeed a fiercely non-partisan, center/institute/think tank that would develop and refine a better scientific understanding of how we got into this mess; and then translate that science into policy to help us get out of it.

Brother Turchin, it ain’t gonna happen. Empires fall. None yet has found the solution to eternal life. I don’t usually say this, but, Brother, trust your equations. Creating yet another think tank that issues policy reports is foredoomed. Save your time and money.

If there is any hope, and there always is, it is in a spiritual regeneration. Making that happen is not so easy.

December 19, 2017 | 5 Comments

Free Data Science Class: Predictive Case Study 1, Part VI


This class is neither frequentist nor Bayesian nor machine learning in theory. It is pure probability, and unlike any other class available. And for the right price!

Last time we completed this model:

    (5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

What we meant by “fixed math notions” gave us the multinomial posterior predictive, from which we made probabilistic predictions of new observables. Other ideas of “fixed math notions” would, of course, give us different models, and possibly different predictions. If we instead started from knowledge only of measurement, and grading rules, we could have deduced a model for new observables, too. This is done in Uncertainty. But the results won’t, in this very simple case for our good-sized n, be much different.

We next want to add other measurements to the mix. Besides CGPA, we also measured High School GPA, SAT scores (I believe these are in some old format; the data you will recall is very old and on an unknown source), and hours spent studying for the week. We want to construct models like this:

    (7) Pr(CGPA = 4 | grading rules, old observables, old correlates, math notions),

where “old observables” are measures CGPA and “old correlates” are measures of things we think are “correlated” with the observable of interest.

This brings us to our next and most crucial questions. What is a “correlate” and why are we putting them in our models? Don’t we need to test the hypotheses, via wee p-values or Bayes factors, that these correlates are “significantly” “linked” to the observable? What about “chance”?

Here is the weakest point of classical statistics. Now we have no chance here of having a complete discussion of the meaning and answers of these questions. We’ll have a go, but the depth will be unsatisfactory. All I can do it point to Uncertainty, and to other articles on the subject, and hope the introduction here is sufficient to progress.

What many are after can’t be had. The information about why a correlate is important is not in the data, i.e. the measurements of the correlate itself. Because of this, no mathematical function of the data can tell us about importance, either. Importance is outside the measured data, as we shall see. Usefulness is another matter.

Under strict probability, which is the method we are using, a “correlate” is any measure of bit of evidence you put on the right hand side. Here is where ML/AI techniques also excel. For instance, a correlate might be, “sock color of student worn on their third day of class.” With that, we can calculate (7).

Suppose we calculate these:

    (7a) Pr(CGPA = 4 | grading rules, old obs, sock color, math) = 0.05,
    (7b) Pr(CGPA = 4 | grading rules, old obs, math) = 0.05,

and the same for every values of CGPA (here we only have 5 possibly values, 0-4, but what is said counts for however we classify the observable). I mean, the prediction is the same (exactly identical) probability whether or not we include sock color, then in this model in this context and given these old obs, the sock color is irrelevant to the uncertainty in CGPA.

If we change anything on the right hand sides of (7a) or (7b) such we get

    (7a) Pr(CGPA = 4 | grading rules, even more old obs, sock color, math) = 0.05,
    (7b) Pr(CGPA = 4 | grading rules, even old obs, math) = 0.051,

then sock color is relevant to our uncertainty in CGPA. Relevance, then, is a conditional measure, just as probability is. Any difference (to withing machine floating-point round off!) in probabilities for any CGPA (with these givens), then sock color is relevant.

Irrelevance is, as you can imagine, hard to come by. Even a cloud, made up of water and cloud condensation nuclei, can resemble a duck, even though the CCN have no artistic intentions. As for importance, that’s entirely different.

Would you, as Dean (recall we are a college dean), make any different decision given (7a) = 0.05 and (7b) = 0.051? (You have to also consider all the other values of CGPA you said were important, and at least one other value will differ by at least 0.01.) If so, then sock color is useful. If not, then sock color is useless. Or of no use. Even though it is, strictly speaking, relevant.

Think about this decision. Think very hard. The decision you make might be different than the decision somebody else makes. The model (7a) may be useless to you and useful to somebody else.

And then you think to yourself, “You know, that 0.01 can make a big difference when I consider tens of thousands of students” (maybe this is a big state school). So (7a) becomes interesting.

Well, how much would it cost to measure the sock color of every student on the third day of their class? It can be done. But would it be worth it? And you have to know it if you use (7a) instead of (7b). It’s a requirement. Besides, if students knew about the measurement, and they caught wind that, say, red colors have higher probabilities of large CGPA than any other color, wouldn’t they, being students and by definition ignorant, wear red on that important day? That would throw off the model. (Answering why we do next time.)

Now if you dismiss this example as fanciful and thus not interesting, you have failed to understand the point. For it is the cost and consequences of the decisions you make that decide whether a relevant “variable” is useful. (Irrelevant “variables” are useless by definition.) We must always keep this in mind. The examples coming will make this concept sharper.

“But, Briggs, what could sock color have to do with CGPA?”

Sounds like you’re asking a question about cause. Let’s save that for next time.

It’s Christmas Break! Class resumes on 9 January 2018.