Class - Applied Statistics

# Free Data Science Class: Predictive Case Study 1, Part VII

Review!

This is our last week of theory. Next week the practical side begins in earnest. However much fun that will be, and it will be a jolly time, this is the more important material.

Last time we learned the concept of irrelevance. A premise is irrelevant if when it is added to the model, the probability of our proposition of interest does not change. Irrelevance, like probability itself, is conditional. Here was our old example:

(7a) Pr(CGPA = 4 | grading rules, old obs, sock color, math) = 0.05,
(7c) Pr(CGPA = 4 | grading rules, old obs, math) = 0.05,

In the context of the premises “grading rules, old obs, math”, “sock color” was irrelevant because the probability of “CGPA = 4” did not change when adding it. It is not that sock color is unconditionally irrelevant. For instance, we might have

(7d) Pr(CGPA = 3 | grading rules, old obs, sock color, math) = 0.10,
(7e) Pr(CGPA = 3 | grading rules, old obs, math) = 0.12,

where now, given a different proposition of interest, sock color has become relevant. Whether it is useful is, and always will be, whether it is pertinent to any decisions we would make about CGPA = 3. We might also have:

(7f) Pr(CGPA = 4 | grading rules, old obs, sock color) = 0.041,
(7g) Pr(CGPA = 4 | grading rules, old obs) = 0.04,

where sock color becomes relevant to CGPA = 4 absent our math (i.e. model) assumptions. Again, all relevance is conditional. And all usefulness depends on decision.

Decision is not unrelated to knowledge about cause. Cause is not something to be had from probability models; it is something that comes before them. Failing to understand this is the cause (get it!) of confusion generated by p-values, hypothesis tests, Bayes factors, parameter estimates, and so on. Let’s return to our example:

(7a) Pr(CGPA = 4 | grading rules, even more old obs, sock color, math) = 0.05,
(7b) Pr(CGPA = 4 | grading rules, even old obs, math) = 0.051.

Sock color is relevant. But does sock color cause a change in CGPA? How can it? Doubtless we can think of a story. We can always think of a story. Suppose sock color indicates the presence of white or light colored socks (then, the absence of sock color from the model implies dark color or no hosiery). We might surmise light color socks reflect extra light in examination rooms, tiring the eyes of wearers so that they will be caused to miss questions slightly more frequently than their better apparelled peers.

This is a causal story. It might be true. You don’t know it isn’t. That is, you don’t know unless you understand the true cause of sock color on grades. And, for most of us, this is no causation at all. We can tell an infinite number of causal stories, all equally consistent with the calculated probabilities, in which sock color affects CGPA. There cannot be proof they are all wrong. We therefore have to use induction (see this article) to infer sock color by its nature is acausal (to grades). We must grasp the essence of socks and sock-body contacts. This is perfectly possible. But it is something we do beyond the probabilities, inferring from the particular observations to the universal truth about essence. Our comprehension of cause is not in the probabilities, nor in the observations, but in the intellectual leap we make, and must make.

This is why any attempt to harness observations to arrive at causal judgments must fail. Algorithms cannot leap into the infinite like we can. Now this is a huge subject, beyond that which we can prove in this lesson. In Uncertainty, I cover it in depth. Read the Chapter on Cause and persuade yourself of the claims made above, or accept them for the sake of argument here.

What follows is that any kind of hypothesis test (or the like) must be making some kind of error, because it is claiming to do what we know cannot be done. It is claiming to have identified a cause, or a cause-like thing, from the observations.

Now classical statistics will not usually say that “cause” has been identified, but it will always be implied. In a regression for Income on Sex, it will be claimed (say) “Men make more than women” based on a wee p-value. This implies sex causes income “gaps”. Or we might hear, if the researcher is trying to be careful, “Sex is linked to income”. “Linked to” is causal talk. I have yet to see any definition (and they are all usually long-winded) of “linked to” that did not, in the end, boil down to cause.

There is a second type of cause to consider, the friend-of-a-friend cause, or the cause of a cause (or of a cause etc.). It might not be that sock color causes CGPAs to change, but that sock color is associated with another cause, or causes, that do. White sock color sometimes, we might say to ourselves, is associated with athletic socks, and athletic socks are tighter fitting, and it’s this tight fit that causes (another cause) itchiness, and the itchiness sometimes causes distraction during exams. This is a loose causal chain, but an intact one.

As above, we can tell an infinite number of these cause-of-a-cause stories, the difference being that here it is much harder to keep track of the essences of the problem. Cause isn’t always so easy! Just ask physicists trying to measure effects of teeny weeny particles.

If we do not have, or can not form, a clear causal chain in our mind, we excuse ourselves by saying sock color is “correlated” or (again) “linked to” CGPA, with the understanding that cause is mixed in somehow, but we do not quite know how to say so, or at least not in every case. We know sock color is relevant (to the probability), but the only way we would keep it in the model, as said above, is if it is important to a decision we make.

Part of any decision, though, is knowledge of cause. If we knew the essences of socks, and the essence of all things associated with sock color, and we judge that these have no causal power to change CGPA, then it would not matter if there were any difference in calculated probabilities between (7a) and (7b). We would expunge sock color from our model. We’d reason that even a handful of beans tossed onto the floor can take the appearance of a President’s profile, but we’d know the pattern was in our minds and not caused intentionally by the bean-floor combination.

If we knew that, sometimes and in some but not necessarily all instances, that sock color is in the causal chain of CGPA (as in for instance tightness and itchiness) then we might include sock color in our model but only if it were important for decision.

If we ignorant (but perhaps only suspicious) of the causal chain of sock color, which for some observations in some models we will be, we keep the observation only if the decision would change.

Note carefully that it is only knowledge of cause or decision that lead to use accepting or rejecting any observable from our model. It has nothing to do (per se) with any function of measurements. Cause and decision are king in the predictive approach. Not blind algorithms.

In retrospect, this was always obvious. Even classical statisticians (and the researchers using these methods) do not put sock color into their models of grade point. Every model begins with excluding an infinity of non-causes, i.e. of observations that can be made but that are known to be causally irrelevant (if not probabilistically) irrelevant to the proposition of interest. Nobody questions this, nor should they. Yet to be perfectly consistent with classical theory, we’d have to try and “reject” the “null” hypotheses of everything under, over, around, and beyond the sun, before we were sure we found the “true” model.

Lastly, as said before and just as obvious, if we knew the cause of Y, we don’t need probability models.

Next week: real practical examples!

Homework I do not expect to “convert” those trained in classical methods. These fine folks are too used to the language in those methods to switch easily to this one. All I can ask is that people read Uncertainty for a fuller discussion of these topics. The real homework is to find an example of or try to define “linked to” without resorting somewhere to causal language.

Once you finish that impossible task, find a paper that says its results (at least in part) were “due to” chance. Now “due to” is also causal language. Given that chance is only a measure of ignorance, and therefore cannot cause anything, and using the beans-on-floor example above, explain what it is people are doing saying results were “due to” chance.

### 21 replies »

1. So probability is not cause, and cause is not probability. Probabilities can only inform decisions when causes are obscure. Of course, when causes are obscure, then probabilities can not be accurately measured (or rather, estimated).

2. Briggs says:

McChuck, Almost. When causes are obscure, we can often accurately measure probabilities. Take the example of ordinary die. Pr(6 | 6-sided object thrown once, with one side 6, and where one side must show) = 1/6 by deduction. The object can even be fictional, so there is no concern about measuring.

3. JohnK says:

A very nice addition to the series. A careful reader could glean much. In addition, a very nice turning-around of what it would truly mean to “reject” the ‘null’ hypothesis. But one caveat, and one additional observation.

“Note carefully that it is only knowledge of cause or decision that lead to use [“use” = us] accepting or rejecting any observable from our model.”

Not quite. First, ‘excluding’ an observable is not coterminous with ‘rejecting’ it. Viz., a second infinity of possible observables are neither accepted nor rejected. These observables are excluded from a model, but neither accepted nor rejected, because they are not even considered.

If a specific observable is brought to mind, then and only then can it be specifically accepted or rejected (within a priori knowledge of cause, or future decision). Thus, one implicit problem with the sentence is that infinities (of observables, for example) is itself a problem for a finite mind. A second problem is the issue of false knowledge or wayward decision, leading to acceptance or rejection that goes down the rabbit hole. So the sentence as it stands obscures as much as it reveals.

A preliminary revision of the sentence: “Note carefully that when we accept or reject any specific observable from our model, we do so (validly) only by means of knowledge of cause or decision.”

Now the observation. Unbaptized Aristotelian induction is too weak a reed on which to found a grasp of the deeper problem of reason’s relation to, and grasp of, cause and truth. Aristotelian ‘induction’ implies, or better, generates, reason as autonomous, which serves neither Augustinian illumination, nor (the vastly and sadly neglected, even by St. Thomas) Thomist trahi a Deo.

The utter indigence of reason is not well-represented by Aristotelian induction, and at the limit, it is not represented at all. Why would the One, the Prime Mover, perfect, entire, and complete in Itself, (N.B.: Itself) even wish, let alone allow, any connection with anything not Itself, not-One, not perfect, corruptible — corrupt?

Plato ‘resolved’ this root problem not with reason but with myth, with the Fall of the Forms; which is not ideal, but much more satisfactory a solution than Aristotle’s, who solved the problem by ignoring it.

That the Way, the Truth, and the Light can touch us, and wishes to, is not in dispute. That we can even stretch out our hand and touch His side, and measure His Wound in millimeters, is entirely His doing, though we do it. But the very idea of such a thing is repugnant to the Greeks, as St. Paul knew well.

The absolutely decisive role of Jesus’s death in the salvation of reason – in the salvation, even the possibility, of induction – is not merely utterly unfamiliar to Aristotle; it is actively repugnant to Aristotelianism.

Aristotelian induction perpetuates a deep wound in Catholic theology, by assuming an autonomous reason, which of its essence can have no Catholic interest.

4. JH says:

Wharton Business school wishes to accept students who are more likely to earn a CGPA greater than 3.50 than not. Briggs is asked to construct a model for admission decisions. He mysteriously derives the following.

(7a) Pr(CGPA > 3.50 | sock color=red, old obs, sock color, whatever) = 0.51
(7b) Pr(CGPA > 3.50 | sock color=black, old obs, whatever) = 0.49,
(7c) Pr(CGPA > 3.50 |old obs, whatever) = 0.50.

Wharton: ”Dr. Briggs, evidently, you believe sock color is relevant. What is your recommendation?”

Briggs: “You might include sock color in the model if it were important to a decision you’d like to make.”

Wharton: “Hmm, it is important that we accept a student whose name is Trump. We will accept him stipulating the condition that he is to wear red socks throughout his student career here. More importantly, when people accuse us of giving preference to rich people, we have the model to show otherwise.”

In the mean time, the sale of black socks is plummeting.

“Luck prefers people who wear red socks,” said my wise grandma.

5. JH says:

Please first note deciding whether to drop a variable from a model is different from deciding whether there is an important/significant difference between male and female in their mean income.

If a method is employable, a modern statistician would happily adopt it.  There are only types of methods, no more type of statisticians.

If people understand the math definitions of terms, there should not be any confusion. It’s not a frequentist confusion or a Bayesian confusion. No, one may not freely substitute the word “linear correlation” with “dependence” or “concordance” or whatnot.  They have different theoretical math definitions and hence their meaning, which may be the same in some cases.

I don’t include the term “relevance” in the above paragraph because its definition given here looks like the theoretical definition of dependence and strangely involves both variables and old observation, i.e., it is not a theoretical definition.

One solution is to begin with a theoretical definition of independence (dependence) and rename it as irrelevance (relevance). What I am going to write next is a variation of standard practices in reporting a case study or project results. State the objective or the decision to be made. Describe how and what data are ascertained (and why). Then, explain based on theory how a model/method can help us in the decision making. Of course, conclusions should always be reported.

6. Briggs – Your description of the six sided die left out the all-important color scheme of the die. 🙂

7. Briggs says:

JH,

Excellent example, for what you have described is exactly precisely the wrong way to do it. Which is to say, the wrong way is by having the statistician decide what is important and decisionable. You’ve identified an important weakness of significance testing, which works in just that way.

The Wharton dean in this case would rightly say, “Go back forget the CGPA by every 1 and give me CPGA broken down by less than 3.5, and greater than or equal, and get rid of the socks.”

8. Briggs says:

JH,

The term relevance belongs to Keynes (and other logical probabilists). I wrote about this extensively in Uncertanity.

9. JH says:

Which is to say, the wrong way is by having the statistician decide what is important and decisionable. You’ve identified an important weakness of significance testing, which works in just that way.

So, who decides to use (7.a) and (7.b) (unnecessarily) to make a conclusion/decision of “you-decide” as to whether to keep Sock Color in the model, blockquoted below?

Why would YOU, the statistician decide that the Wharton dean (not who I had in mind, but Dean it is.) would rightly want you to tally CGPA’s or get rid of the Sock Color that is relevant? Whose CGPAs are to be tallied? The mysterious old obs? Are you telling the Dean to change its objective setforth at the beginning of the scenario?

HA. I thought, by using the sock color proposed by you, and adopting your way and your conclusion of

“ we would keep it in the model, as said above, is if it is important to a decision we make,”

that I have given an example demonstrating how your way can possibly lead to a ridiculous conclusion, and is useful in helping people get the decision they want, i.e., providing a justification for the acceptance of Trump because it is important to the Dean to accept him. That is, your way may lead to a can of fermented fish.

Statisticians might not be able to decide other’s project objectives. They can help decide how to obtain data and what data to collect accordingly. Sometimes, zipped data would show up in their mailbox with a demand they mine their data and report their findings, without clear questions to be answered and formulated. Surely it is their job to see what conclusions they can possibly draw, what decisions can possibly be made based on the data by applying all appropriate methods available to them. Is this the wrong way? No. The main problem lies on whether a so-called statistician or a practitioner of statistics is competent. A thermometer is not to be blamed when the weather is cold. It is simply a tool.

10. JH says:

The term relevance belongs to Keynes (and other logical probabilists). I wrote about this extensively in Uncertanity.

Keynes does not use the term relevance the way you present here. He deals with propositional evidence, not observed data evidence. As pointed out by Jon Williamson in his book, Keynes’ theory only interprets single-case probabilities since propositions are the relata of the probability relation.

While it may be possible to use the logical relation between propositions to assign probabilities, I seriously doubt that people can find logical relations among observed data values to assign probabilities logically. Let me repeat myself, in the practice of statistical data analysis, every observed factor would be relevant to the response variable based on your definition here, which makes the discussion of “relevance” redundant.

11. Briggs says:

JH,

Ah, Williamson and his “relata”. I do not agree with Williamson, as I explain in Uncertainty. All our left hand sides are propositions, as said from the beginning, and in Uncertainty. Keynes and I are identical with relevance. See also Stove, the second half of The Rationality of Induction.

How you brought Trump into this curious, but (wait for it) irrelevant.

12. JH says:

Are your mysterious “old obs” propositional evidence? Who knows! What do you not agree with Willliamson? His usage of the word relata? Or the fact that you cannot find logical relations in the observed DATA and therefore cannot logically deductive the probabilities. beside adopting the principle of indifference?

Change Trump to Obama or Bush if you wish. The point is that what you have presented here can lead to ridiculous decision. All methods have their downfalls. If not, no more academic statisticians would be needed.

13. JH says:

Are your mysterious “old obs” propositional evidence? Who knows! What do you not agree with Willliamson? His usage of the word relata? Or the fact that you cannot find logical relations in the observed DATA and therefore cannot logically deductive the probabilities. beside adopting the principle of indifference?

Change Trump to Obama or Bush if you wish. The point is that what you have presented here can lead to ridiculous decisions. All methods have their downfalls. If not, no more academic statisticians would be needed.

14. Briggs says:

All,

JH’s is a great comment because it highlights the fundamental and indeed irreconcilable difference in interpretations of probability.

I say all probability is like logic, a relation between propositions. These propositions can be anything, just like in logic, and can even be measurements. This philosophy is in the long tradition of Laplace, Keynes, Jeffreys, De Finetti, Jaynes, Franklin, Stove, and many others. This approach is not taught in statistics or math departments (all academic departments become insular), hence it is unfamiliar to most statisticians. It was to me, and I managed to come out of a pretty good program having heard next to nothing about it. Uncertainty (toot! toot!) describes the philosophy in full.

We can contrast this to other approaches, all of which focus on the math so much that they all, when it comes time to apply that math, engage in the Deadly Sin of Reification. Take the comment “you cannot find logical relations in observed DATA”. The sentiment behind this sees data, i.e. measurements on observables, as somehow alive, and possessing probability, as it if were a real thing. “DATA” is somehow unique, and removed from the dull world of propositions, because all “DATA” is amenable to math, whereas not all propositions are.

Well, I don’t have space here to (again) show why this view of probability is false, but it is. I’ve done it dozens of times, as have the authors I cited.

What I can do here is to ask those to hold a refication view of probability to go back and review the old material, or Uncertainty, or indeed any of the authors cited above. Here are two excellent non-Briggs references showing the main reification view (relative frequency) is false.

Mises redux—Redux: Fifteen arguments against finite frequentism, Alan Hajek, Erkenntnis, 209–227, 45, 1997.

Fifteen Arguments Against Hypothetical Frequentism, Alan Hajek, Erkenntnis, 211–235, 70, 2009.

15. JH says:

I see. So when I made the statement that you can or cannot find logic (or a pattern) in data (or in the sequence of numbers {2,4,6,8,10,…}), I also emit the sentiment that data (or the sequence of numbers) is somehow alive and unique, and this and that, and therefore this and that.

Magic me. No more comments on the curios emission of the strange sentiment, but I will repeat my questions.

1. On what do you disagree with Williamson, who, unlike you, has been trained in graduate school, and has expertise on causality, probability, logic and applications of formal reasoning within science and mathematics?

2.How do you deductively or logically assign probabilities to statements involving/ conditioning on observed data, going beyond toy examples and theoretical examples and the principle of indifference? By enforcing assumptions? May I suggest that you use the CGPA data?

Data Science!

16. JH says:

Have you noticed that in Keynes book, A treatise on Probability, there are no real-life (no rants about my using this term, please) data? He does not demonstrate in his book HOW data yield whatever probability statements or premises or propositions or whatnot that required for his arguments. Here are some typical role of “data” in his book:

“When data cooperate as evidence in favor of a proposition…”
“If our initial data are of such a character…”

When. If. Assume.

How and why does the discussion turn to the attack of frequentest methods? I am sorry that I don’t want to spend time in finding papers to support them or to attack logical probability or whatever.  I am reminded of the book by Hajek, et al (2016)
The Oxford Handbook of Probability and Philosophy (Oxford Handbooks)
. It contains  updated research on probability and causation (and of course, objective literature reviews too) by various philosophers, including Williamson, in this area. I highly recommend it.

17. Briggs says:

JH,

“When data cooperate as evidence in favor of a proposition…” Exactly so, exactly so. And he does, of course, demonstrate how observations as premises modify probability conclusions. Just as I do in Uncertanity, and Jaynes does, and all the other authors I mention do, and indeed every Bayesian, even (e.g. posteriors etc. etc.).

My powers of convincing everybody of this are admittedly limited. But do let us know when you’ve had time to read the Hajek papers I’ve cited. I’ve read the entry you suggested and see nothing in it proving our position wrong. Hajek does in a fair manner quote a range of other authors, like Williamson, whose positions are wrong in some way. The entries of Hayek’s I cite explain why we must abandon limiting relative frequency.

18. JH says:

Of course, observations as premise modify probability outcome. Let’s not go around and around by repeating this or “probability is conditional.”

My question is how to assign probabilities using logical probability in statistical modeling (i.e., modelling data) to start with. “Modify” and “assign” are two different verbs.

You mean you have read the thick book before or you finish reading it in less than 4 hours? In what paper that Hajek quoted Williamson?

You know,  you can show that Williamson wrong in your practical, case study as promised for your online data science class.

Hajek’s 15 arguments against frequentism – My first though is that in practice, relative frequencies.empirical distributions are used to approximate unknown probabilities, which is not what he is taking about. It is clarified by him, as stated in the paper,

The thesis before us is not that probability is approximately relative frequency, but that it is relative frequency. We have an identification of probability with relative frequency. Of course, it implies that we can approximate probability values as closely as we like with relative frequency values — anything approximates itself as closely as we like! — but it is a much stronger claim. The point about approximation might be appropriate in justifying relative frequentism as good methodology for discovering probabilities; but our topic is the analysis of probability, not its methodology.

19. Briggs says:

JH,

Oh, I see! First Hayek. Yep. That’s the question. Is probability defined as, identified as, limiting relative frequency? The answer is no.

Second, assign vs. modify. I should have caught this. In logic, there is no difference between the two. Both have the same schema:

p
_
q

or

Pr(p|q)

which is to say, a list of premises p and a proposition of interest q, from which its probability is deduced given p. You can call this the assignment. The deduction, as I have said often, does not always result in a number. If we add to p new premises (say, observations), then we have a new argument, as it were, and we re-deduce the probability of q. You can call this modification, if you like. But it’s all the same logical operation.

I think we can lay the blame for the assign-modify division to the subjective Bayesians, who are always saying they are modifying probabilities they assigned.

20. Bob Allan says:

Thank you for a very interesting post and series.

I’m trying to understand whether a method such as A/B testing (i.e., in the context of websites) falls into the set of methods which you criticize/describe as inaccurate/incorrect/would mostly lead to wrong conclusions (sorry, not sure how to phrase that). And if so can you please try to explain why? (so far the material in the posts has focused on observational studies, and not ‘active treatments’)

To my understanding, and trying to critically apply what you have taught, A/B testing:
* uses randomization to create a control cohort – I understand that that randomization is not truly random and far from perfect, but doesn’t it guarantees that participants and their various causal backgrounds will be balanced across cohorts sufficiently for this purpose? (again assuming the randomization is ‘sophisticated’ enough, the experiment is being ran for long enough, and there are sufficient sample sizes in each cohort)

* our intervention with only the treatment cohort, guarantees that if there is a significant difference between the measured effect of the control vs. treatment, will be due (causal) to our treatment (again, there are plenty of assumptions around ‘significant difference’, how it is measured, sample sizes, p-values, etc. but aren’t there settings that would satisfy for this in this use case?)

Again, apologies for how this question is phrased, would appreciate if you could try to answer.

(does the book cover this particular scenario in more depth? I’ve noticed it mentions randomized clinical trials but are they the same as web experiments? the major difference I see in web experiments is the scale – number of participants…)