Skip to content

Category: Class – Applied Statistics

December 13, 2017 | 7 Comments

What Is The Probability Of COVFEFE?

From a tweet from Taleb, who informs us the following question is part of the Indian Statistical Institute examination.

(5) Mr.Trump decides to post a random message on Facebook and he starts typing a random sequence of letters {Uk}k≥1 such that they are chosen independently and uniformly from the 26 possible english alphabets. Find out the expected time of the first appearance of the word COVFEFE.

Now it is too good to check whether this is really used by the ISI, but I hope it is. It is too delicious. (Yes, it was Twitter, not Facebook.)

Regular readers will recall we had a Covfefe Sing-along after Trump’s masterly tweet.

The night Donald Trump took to Twitter
Elites had a terrible fit
Trump warned the world of covfefe
And Tweet streams were filled up with sh—

—Shaving cream.
Be nice and clean.
Shave everyday and
you’ll always look keen…

The ISI’s COVFEFE problem has much to recommend it, because it chock full of the language of modern probability that is so confusing. (Even my title misleads! Nothing “has” a probability!)

Now I learned my math from physicists, who do things to equations that make mathematicians shudder, but which are moves that are at least an attempt to hew to reality. There isn’t anything wrong with mathematician math, but the temptation to the Deadly Sin of Reification can be overwhelming. And why all those curly brackets? They intimidate.

I still recall in a math course struggling with some higher-order proofs from Billingsley (a standard work on mathematical probability) when a Russian mathematician made everything snap into clarity when he told me X, the standard notation for a “random variable” which all the books said “had” a distribution, “was a function”, whereas as a physicist I always saw it as an observable or proposition. It can, of course, be both, but if you ever want to apply the math, it is a proposition.

So here is Trump typing. What does it mean—think like a physicist and not a mathematician—to “independently and uniformly” choose letters? To choose requires a method of choosing. Some thing or things are causing the characters to appear on the screen. What? Trump closing his eyes and smacking his hands into the keys? Maybe. But, if so, then we have no hope of identifying the causes of what appears. If we don’t know the causes, we can’t answer how long it will take. We can’t solve the problem.

Enter probability, which can’t answer the question, but can answer similar ones, like “Given certain assumptions, what are the chances it takes X seconds?”

Since all probability is conditional on the assumptions made, the assumptions matter. What are they?

Choosing letters “independently” is causal language. “Uniformly” insists the probability of every letter being typed is equal, a circular definition, since what we want to know is the probability. Say instead “There are 26 letters, one of which must be typed once per time unit t, where knowledge of the letters typed previously tell us nothing about letters to be typed.”

Since COVFEFE (we’re working with all caps via the information given) is 7 letters, we want to characterize the uncertainty in the total time it takes to type this sequence.

Do we have all we need? Not quite. Again, think like a physicist and not a mathematician. How long is Trump going to sit at the computer? (Or play with his Portable Thinking Suppression Device (PTSD)?) It can’t be forever. That means there should be a chance we never see COVFEFE. On the other hand, if we assume Trump types forever, then it is obvious that not only must COVFEFE appear, but it must appear an infinite number of times!

Indeed, if we allow the mathematical possibility of eternal typing, not only will COVFEFE appear in infinite plenitude, Trump will also type the entire works of Shakespeare, not just once, but also an infinite number of times. And the entire corpus of all works that can be types in 26 letters sans spacing. Trump’s a genius!

Well that escalated quickly. That’s because The Limit is a bizarre place. Our intuition breaks down.

We still have to decide how fast Trump can type. Maybe two to five letters per second, but not faster than that. But that’s the physicist in me speaking. Keyboards and fingers can’t be engineered for infinitely fast typing. A mathematician might allow one character per infinitesimal time unit. If so, we have another infinity that has crept in. If one infinity was weird, trying mixing two.

Point is, since probability needs assumptions, we need to make explicit all of them. The problem doesn’t do that. We have to bring our knowledge of English grammar to bear, which we always do, and which part of the conditions. It will be no surprise people can come to different answers.

Homework: Assume finite time in which to type, and discrete positive real time to type each letter; assume also the simple characters proposition I gave and then calculate the probability of COVFEFE at t = 0, 1, 2, … n typing time units (notice this adds the assumption that letters come regularly with no variation, another mathematical, non-physical assumption). And then calculate the first appearance by t = 0, 1, 2, … n. Then calculate the expected value (is it even interesting?). After you have that, what happens in n goes to infinity? (It that even interesting?) And can you also have the time unit decrease to the infinitesimal?

Hint. The probability of seeing COVFEFE and not seeing COVFEFE must sum to 1. If n = 1, the (conditional on all these assumptions) probability of COVFEFE is 0, and not-COVFEFE is 1. Same with n = 2, 3, 4, 5, and 6. What about n = 7? And so on?

December 12, 2017 | 18 Comments

Free Data Science Class: Predictive Case Study 1, Part V


This class is neither frequentist nor Bayesian nor machine learning in theory. It is pure probability, and unlike any other class available. And for the right price!

We have all we need if we want to characterize our uncertainty in future CGPAs given only the grading rules, the old observations, and the simple math notions of the multinomial model. I.e., this:

    (5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Given a set of observations of the number in each bucket, we can predict the probability of a new person having each possible CGPA. We are using the data found in Uncertainty, which is in a CSV file here.

CGPA comes with too much precision (I found the file on line many, many years ago, and cannot rediscover its origins), with measurements to the hundredth place. It is therefore useful to have a function that rounds to the nearest specified decisionable fraction. I modified this roundTo function to do the job (we’re using R, obviously).

roundTo <- function(y, num, down = FALSE) {
    resto = y%%num
    # to round down use '<=' 
        i = which(resto <= (num/2))
    } else {
        i = which(resto < (num/2))
    # if you don't think you need binary subtract, try these,
    # which should both give 0; try other numbers in (0,1)
    # a=.88; a + 1 - a%%1 - 1
    # a=.89; a + 1 - a%%1 - 1
    y = y + `-`(num , resto)
    if(length(i)) y[i] = `-`(y[i] , num) 

The reason for the back ticks is given in the comments. Since we're classifying into buckets, floating point math can make buckets which should be 0 into something like 10^-16, which is not 0, and which is also not interesting. Use of the binary subtract function fixes this. If you don't understand the code, don't worry about it, just use it.

Read the data into R (put your path to the csv file into path):

# path = 'C:/mydrive/mypath/' #windows; note direction of slashes
# path = '/home/me/mypath/' # unix, mac
x = read.csv(paste(path,'cgpa.csv',sep=''))

Then apply our function:


 0  1  2  3  4 
 4 17 59 16  4 

I'm mixing code and output here, but you should be able to get the idea. There are n = 100 observations, most of which are CGPA = 2. The model (5) in code (mpp for multinomial posterior predictive):

mpp <- function(x, nd = 3){
  # nd = number of significant digits 
  x = (1 + x)/(sum(x)+dim(x))

This is model (5) in all its glory! Note that this is a bare-bones function. All code in this class is for illustration only, for ease of reading; nothing is optimized. This code does no error checking, doesn't handle missing values; it only spits out the answer given a table as input, like this (the signif rounds to significant digits):


     0      1      2      3      4 
0.0476 0.1710 0.5710 0.1620 0.0476 

Notice there is less than a 59% chance of a new CGPA = 2, but more than a 4/100 chance of a CGPA = 1. The future is less certain than the past! Suppose we wanted finer gradations of CGPA, say to the nearest 0.5:


  0 0.5   1 1.5   2 2.5   3 3.5   4 
  2   3   8  21  33  20   7   4   2  


     0    0.5      1    1.5      2    2.5      3    3.5      4 
0.0275 0.0367 0.0826 0.2020 0.3120 0.1930 0.0734 0.0459 0.0275 

Play with other values of num in roundTo(). We're done, really, with what we can do with (5), except, of course, for checking it on real new measurements. Which I don't have. And which brings up an important point.

The point of the predictive method is to make testable predictions, which we have just done. But we can't test them until we get new measurements. Yes, we can and will check the old data as if it were new, but this is always cheating, because as everybody does or should know, it is always possible to derive a model which fits data arbitrarily well. Schemes which split data into "training" and "testing" sets cheat too if they ever in any way use the results of the testing data to tweak the model. That just is to use all the data in fitting/training. Though there are attempts and supposed techniques to reuse data, the only way to assess the performance of any model is to compare it against data that has never before been seen (by the model).

Model (5) can't be pushed further. But we do have other formal, measured information at hand, about which more in a moment. Of informal, non-quantifiable evidence, we are loaded. We can easily do this:

    (6) Pr(CGPA = 4 | grading rules, old observation, fixed math notions, E),

where E is a joint proposition carrying what you know about CGPA; things like, say, majors, schools, age, etc. Things which not formally measured and even unmeasurable. After all, to what schools, times, places, people does (5) apply? Pay attention: this is the big question! It by itself says all schools, all times, all places, all peoples---as long as they conform to the formal grading rules.

Pause and consider this. (5) is universal. If the old observations came from, say, Sacred Heart Institute of Technology and we knew that, which we don't (recall I found this data maybe twenty years ago from a place only the Lord knows), then we might insist E = "The school is Sacred Heart only". Or E = "The school is like Sacred Heart." Like is not quantifiable, and will differ widely in conception between people. Well, and that means (6) will be different for each different E. Each conception gives a different model!

Again, this is not a bug it is a feature.

Notice that (6) is not (5), a trivial point, perhaps, but one that can be forgotten if it is believed there is a "true" model out there somewhere, where "true" is used in the sense that probability is real or that we can identify cause. We've already discussed this, so if you don't have it clear in your mind, review!

Next time we introduce SAT, HGPA, and time spent studying, and see what we can do with this formal measurements.

Homework: Using (5) and the data at hand, suppose there are n = 20 new students. What can you say about the predictions of the numbers of new students having CGPA = 4, etc.?

December 5, 2017 | 9 Comments

Free Data Science Class: Predictive Case Study 1, Part IV


Code coming next week!

Last time we decided to put ourselves in the mind of a dean and ask for the chance of CGPA falling into one of these buckets: 0,1,2,3,4. We started with an simple model to characterize our uncertainty in future CGPAs, which was this:

    (5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Now the “fixed math notions” means, in this case, that we uses a parameterized probability multinomial distribution (look it up anywhere). This model via the introduction of non-observable, non-real parameters, little bits of math necessary for the equations to work out, gives the probability of belonging to one of the buckets, which in this case are 5, 0-4.

The parameters themselves are the focus of traditional statistical practice, in both its frequentist and Bayesian flavors. This misplaced concentration came about for at least two reasons: (a) the false belief that probabilities are real and thus so are parameters, at least “at” infinity and (b) the mistaking of knowledge of the parameters for knowledge of observables. The math for parameters (at infinity) is also easier than looking at observables. Probability does not exist, and (of course) we now know knowledge of the parameters is not knowledge of observables. We’ll bypass all of this and keep our vision fixed on what is of real interest.

Machine learning (and AI etc.) have parameters for their models, too, for the most part, but these are usually hidden away and observables are primary. This is a good thing, except that the ML community (we’ll lump all non-statistical probability and “fuzzy” and AI modelers into the ML camp) created for themselves new errors in philosophy. We’ll start discussing these this time.

Our “fixed math notions”, as assumption we made, but with good reason, include selecting a “prior” on the parameters of the model. We chose the Dirichlet; many others are possible. Our notions also selected the model. Thus, as is made clear in the notation, (5) is dependent on the notions. Change them, change answer to (5). But so what? If we change the grading rules we also change the probability. Changing the old observations also changes the probability.

There is an enormous amount of hand-wringing about the priors portion of the notions. Some of the concern is with getting the math right, which is fine. But much is because it is felt there are “correct” priors somewhere out there, usually living at Infinity, and if there are right ones we can worry we might not have found the right ones. There are also many complaints that (5) is reliant on the prior. But (5) is always reliant on the model, too, though few are concerned with that. (5) is dependent on everything we stick on the right hand side, including non-quantifiable evidence, as we saw last time. That (5) changes when we change our notions is not a bug, it is a feature.

The thought among both the statistical and ML communities is that a “correct” model exists, if only we can find it. Yet this is almost never true, except in those rare cases where we deduce the model (as is done in Uncertanity for a simple case). Even deduced models begin with simpler knowns or assumptions. Any time we use a parameterized model (or any ML model) we are making more or less ad hoc assumptions. Parameters always imply lurking infinities, either in measurement clarity or numbers of observations, infinities which will always be lacking in real life.

Let’s be clear: every model is conditional on the assumptions we make. If we knew the causes of the observable (here CGPA; review Part I) we could deduce the model, which would supply extreme probabilities, i.e. 0s and 1s. But since we cannot know the causes of grade points, we can instead opt for correlation models, as statistical and ML models are (any really complex model may have causal elements, such as in physics etc., but these won’t be completely causal and thus will be correlational in output).

This does not mean that our models are wrong. A wrong model would always misclassify and never correctly classify, and it would do so intentionally, as it were. This wrong model would be a causal model, too, only it would purposely lie about the causes in at least some instances.

No, our models are correlational, almost always, and therefore can’t be wrong in the sense just mentioned; neither can they be right in the causal sense. They can, however, be useful.

The conceit by statistical modelers is that, once we have in hand correlates to our observables, which in this case will be SAT scores and high school GPAs, that if we increase our sample size large enough, we’ll know exactly how SAT and HGPA “influence” CGPA. This is false. At best, we’ll sharpen our predictive probabilities to some extent, but we’ll hit a wall and will go no further. This is because SAT scores do not cause CGPAs. We may know all there is to know about some non-interesting parameters inside some ad hoc model, but this intensity will not transfer to the observables, which may be as murky as ever. If this doesn’t make sense, the examples will clarify it.

The similar conceit of the ML crowd is that if only the proper quantity of correlates are measured, and (as with the statisticians) measured in sufficient number, all classification mistakes will disappear. This is false, too. Because unless we can measure all the causes of each and every person’s CGPA, the model will err in the sense that it will not produce extreme probabilities. Perfect classification is a chimera—and a sales pitch.

Just think: we already know we cannot know all the precise causes of many contingent events. For example, quantum measurements. No parameterized model nor the most sophisticated ML/AI/deep learning algorithm in the world, taking all known measurements as input, will classify better than the simple physics model. Perfection is not ours to have.

Next time we finally—finally!—get to the data. But remember we were in no hurry, and that we purposely emphasized the meaning and interpretation of models because there is so much misinformation here, and because these are, in the end, the only parts that matter.

December 1, 2017 | 5 Comments

Parameters Aren’t What You Think

I was asked to comment on a post by Dan Simpson exploring the Bernstein-von Mises theorem.

This post fits in with the Data Science class, a happy coincidence, and has been so categorized.

A warning. Do not click on the link to a video by Diamanda Galas. I did. It was so hellishly godawful that I have already scheduled a visit with my surgeon to have my ear drums removed so that not even by accident will I have to listen to this woman again.

Now the Bernstein-von Mises theorem says, loosely and with a list of caveats given by Simpson, that for a parameterized probability model and a given prior, the posterior on the parameter converges (in probability) to a multivariate normal with a covariance matrix an inverse function n and of the Fisher Information Matrix centered around the “true” parameter.

It doesn’t matter here about the mathematical details. The rough idea is that, regardless of the prior used but supposing the caveats are met, the uncertainty in the parameter becomes like the uncertainty a frequentist would assess of the parameter. Meaning Bayesians shouldn’t feel too apologetic around frequentists frightened that priors convey information. It’s all one big happy family out at The Limit.

There is no Limit, though. It doesn’t exist for actual measures.

You know what else doesn’t exist? Probability. Things to not “have” probabilities or probability distributions (we should never say “This has a normal distribution”). It is only our uncertainty that can be characterized using probability (we should say “Our uncertainty in this is quantified by a normal”). And also non-existent, as a deduction from that truth, are parameters. Since parameters don’t exist, there can’t be a “true” value of them. Yet parameters are everywhere used and they are (or seem to be) useful. So what’s going on?

Recall our probabilistic golden rule: all probability is conditional on the assumptions made, believed, deduced or measured.

Probability can often be deduced. The ubiquitous urn models are good examples. In an urn are n_0 0s and n_1 1s. Given this information (and knowledge or English grammar and logic), the chance of drawing a 1 is n_1/(n_1+n_0). In notation:

     (1) Pr(1 | D) = n_1/(n_1+n_0),

where D is a joint proposition containing the information just noted (the deduction is sound and is based on the symmetry of logical constants where there is no need to talk of drawing mechanisms, randomness, fairness, or whatever; See Uncertainty for details).

If all (take the word seriously) we know of the urn are its constituents, we have all the probability we need in (1). We are done. Oh, we can also deduce answers to questions like, “What are the chances of seeing 7 1s given we have already taken such-and-such from the urn.” But the key is that all is deduced.

So what if we don’t know how many 0s and 1s there are, but we still want:

     (2) Pr(1 | U) = ?,

where U means we know there are 1s and 0s but we don’t know the proportion (plus, as usual and forever, we also in U know grammar, logic, etc.). Well, it turns out the answer is still deducible as long as we assume a value n = n_1 + n_0 exists. We don’t even need to know it, really; we just need to assume it is less than infinity. Which it will be. No urn contains an infinite number of anything. Intuitively, since we have no information on n_1 or n_0, except that they must be finite, we can solve (2). The answer is 1/2. (Take a googol to the googol-th power a googol times; this number will be finite and bigger than you ever need.)

As above, we can ask any kind of question, e.g. “Given I’ve removed 18 1s and 12 0s, and I next grab out 6 balls, what are the chance at least 3 will be 1s?” The answer is deducible; no parameter is needed.

So what if we don’t know n? It turns out not to be too important after all. We can still deduce all the probabilities we want, as long as n is finite. What if n is infinite, though? Well, it can’t be. But what if we assume it is?

We have left physical reality and entered the land of math. We could have solved the problem above for any value of n short of infinity, and since we can let n be very large indeed, this is no limitation whatsoever. Still, as Simpson rightly says, asymptotic math is much easier than finite, so that if we’re willing to close our eyes to the problem of infinitely sized urns, maybe we can make our lives computationally easier.

Population increase

Statistics revolves around the idea of “samples” taken from “populations”. Above, when n was finite, our population was finite, and we could deduce probabilities for the remaining members of the population given we’ve removed so much sample. (Survey statistics is careful about this.)

But if we assume an infinite population, no matter how big a sample we remove, we always have an infinite number left. The deductions we produced above won’t work. But we still want to do probability—without the hard finite population math (and it is harder). So we can try this:

     (3) Pr(1 | Inf) = θ,

where Inf indicates we have an infinite urn and θ is a parameter. The value of this parameter is unknown. What exactly is this θ? It isn’t a probability in any ontological sense, since probabilities don’t exist. It’s not a physical measure as n_1/(n_1+n_0) was, because we don’t know what n_1 and n_0 are except that there are an infinite number of each of them and, anyway, we can’t divide infinities so glibly. (The probability in (1) is not the right hand side; it is the left hand side. The right hand side is just a number!)

The answer is that θ isn’t anything. It’s just a parameter, a placeholder. It’s a blank spot waiting to be filled. We cannot provide any answers to (3) (or questions like those above based on it) until we make some kind of statement about θ. If you have understood this last sentence, you have understood all. We are stuck, the problem is at a dead end. There is nowhere to go. If somebody asks, “Given Inf, what is the probability of a 1?” all you can say is “I do not know” because you understand saying “θ” is saying nothing.

Bayesians of course know they have to make some kind of statement about θ, or the problem stops. But there is no information about θ to be had. In the finite-population case, we were able to deduce the probability because we knew n_1 could equal 0, 1, …, n, with the corresponding adjustments made to n_0, i.e. n, n-1, …, 0. No combination (this can be made more rigorous) was privileged over any other, and the deduction followed. But when the population is infinite, it is not at all clear how to specify the breakdowns of n_1s and n_0s in the infinite urn; indeed, there are an infinite number of ways to do this. Infinities are weird!

The only possible way out of this problem is do what the serial writer of old did: with a mighty leap, Jack was free of the pit! An ad hoc judgment is made. The Bayesian simply makes up a guess about θ and places it in (3). Or not quite, but that would work about would give us

     (4) Pr(1 | Inf; θ = 0.5) = θ (= 0.5).

Hey, why not? If probability is subjective, which it isn’t, then probability can equal anything you feel. Feelings…whoa-oh-a feelings.

No, what the Bayesian does is invoke outside evidence, call it E, which sounds more or less scientific or mathematical, and information about θ, now called the prior, is given. The problem is then solved, or rather it is solvable. But it’s almost never solved.

The posterior is not the end

Having become so fascinated by θ, the statistician cannot stop thinking of it, and so after some data is taken, he updates his belief about θ and produces the posterior. That’s where we came in: at the end.

This posterior will, given a sufficient sample and some other caveats, look like the frequentist point estimate and its confidence interval. Frequentists are not only big believers in infinity, they insist on it. No probability can be defined in frequentist theory unless infinite samples are available. Never mind. (Frequentism always fails in finite reality.)

You know what happens next? Nothing.

We have the posterior in hand, but so what? Does that say anything about (3)? No. (3) was what we wanted all along, but we forgot about it! In the rush to do the lovely (and it is) math about priors and posteriors we mislaid our question. Instead, we speak solely about the posterior (or point estimate). How embarrassing. (Come back Monday for more on this subject.)

Well, not all Bayesians forget. Some take the posterior and use it to produce the answer to (3), or what turns out to be the modification of (3), and what is called the posterior predictive distribution.

     (5) Pr(1 | Inf; belief about θ) = some number.

Here is the funny part, at least for this problem. If we say, as many Bayesians do say, that θ is equally likely to be any number between 0 and 1 (a “flat” prior), then the posterior predictive distribution is exactly the same as the answer for (1).

That’s looking at the wrong way around, though. What happens is that if you take (1) (take it mathematically, I mean) and let n go to infinity in a straightforward way, you get the posterior predictive distribution of (3) (but only with a “flat” prior).

So, at least in this case, we needn’t have gone to the bother of assuming an infinite urn, since we had the right answer before. Other problems are more complex, and insufficient attention has been paid to the finite math, so we don’t have answers in every problem. Besides, it’s easier to assume an infinite-parameter based model and work out that math.


Assuming there is an infinite population, not only of 1s and 0s, but for any statistical problem, is what leads to the false belief that “true” values of parameters exist. This is why people will say “X has a distribution”. Since folks believe true values of parameters exist, they want to be careful to guess what they might be. That’s where the frequentist-Bayesian interpretation wars enter. Even Bayesians joust with each over their differing ad hoc priors.

It should be obvious that, just as assuming a model changes the probability of the observable of interest (like balls in urns), so does changing the prior for a fixed model change the probability of the observable. Of course it does! And should. Because all probability is conditional on the assumptions made; our golden rule. Change the assumptions, change the prior.

There is no sense whatsoever that an “noninformative” prior can exist. All priors by design convey information. To say the influence of the prior should be unfelt is like saying there should be married bachelors. It makes no logical sense. There isn’t even any sense that a prior can be “minimally” informative. To be minimally informative is to keep utterly quiet and say nothing about the parameter.

If there is any sense that a “correct” prior exists, or a “correct” model for that matter, it is in the finite-deducible sense. We start with an observable that has known finite and discrete measurements qualities, as all real observables do, and we deduce the probabilities from there. We then imagine we have an infinite population, as an augmentation of finite reality, and we let the sample go to infinity. This will give and implied prior and posterior and predictive distribution we which can compare against the correct finite sample answer.

But if we had the correct finite sample answer, why use the infinite approximation? Good question. The only answer is computational ease. Good answer, too.

The answer

Even though it might not look it, this little essay is in answer to Simpson. I’m answering the meta-question behind the details of the Bernstein-von Mises theorem, the math of which nobody disputes. As always, it’s the interpretation that matters. In this case, we can invert the BnM theorem and use it to show how far wrong frequentist point-estimates are. After all, frequentist theory can be seen as the infinite-approximation method to Bayesian problems—which themselves are when using parameters infinite-population approximations to finite reality. Frequentist methods are therefore a double approximation, which is another reason they tend to produce so much over-certainty.

What I haven’t talked about, and what there isn’t space for, are these so-called infinite dimensional models, where there are an infinity of parameters. I’ll just repeat: infinity is weird.