Skip to content

Category: Statistics

The general theory, methods, and philosophy of the Science of Guessing What Is.

December 13, 2017 | 7 Comments

What Is The Probability Of COVFEFE?

From a tweet from Taleb, who informs us the following question is part of the Indian Statistical Institute examination.

(5) Mr.Trump decides to post a random message on Facebook and he starts typing a random sequence of letters {Uk}k≥1 such that they are chosen independently and uniformly from the 26 possible english alphabets. Find out the expected time of the first appearance of the word COVFEFE.

Now it is too good to check whether this is really used by the ISI, but I hope it is. It is too delicious. (Yes, it was Twitter, not Facebook.)

Regular readers will recall we had a Covfefe Sing-along after Trump’s masterly tweet.

The night Donald Trump took to Twitter
Elites had a terrible fit
Trump warned the world of covfefe
And Tweet streams were filled up with sh—

—Shaving cream.
Be nice and clean.
Shave everyday and
you’ll always look keen…

The ISI’s COVFEFE problem has much to recommend it, because it chock full of the language of modern probability that is so confusing. (Even my title misleads! Nothing “has” a probability!)

Now I learned my math from physicists, who do things to equations that make mathematicians shudder, but which are moves that are at least an attempt to hew to reality. There isn’t anything wrong with mathematician math, but the temptation to the Deadly Sin of Reification can be overwhelming. And why all those curly brackets? They intimidate.

I still recall in a math course struggling with some higher-order proofs from Billingsley (a standard work on mathematical probability) when a Russian mathematician made everything snap into clarity when he told me X, the standard notation for a “random variable” which all the books said “had” a distribution, “was a function”, whereas as a physicist I always saw it as an observable or proposition. It can, of course, be both, but if you ever want to apply the math, it is a proposition.

So here is Trump typing. What does it mean—think like a physicist and not a mathematician—to “independently and uniformly” choose letters? To choose requires a method of choosing. Some thing or things are causing the characters to appear on the screen. What? Trump closing his eyes and smacking his hands into the keys? Maybe. But, if so, then we have no hope of identifying the causes of what appears. If we don’t know the causes, we can’t answer how long it will take. We can’t solve the problem.

Enter probability, which can’t answer the question, but can answer similar ones, like “Given certain assumptions, what are the chances it takes X seconds?”

Since all probability is conditional on the assumptions made, the assumptions matter. What are they?

Choosing letters “independently” is causal language. “Uniformly” insists the probability of every letter being typed is equal, a circular definition, since what we want to know is the probability. Say instead “There are 26 letters, one of which must be typed once per time unit t, where knowledge of the letters typed previously tell us nothing about letters to be typed.”

Since COVFEFE (we’re working with all caps via the information given) is 7 letters, we want to characterize the uncertainty in the total time it takes to type this sequence.

Do we have all we need? Not quite. Again, think like a physicist and not a mathematician. How long is Trump going to sit at the computer? (Or play with his Portable Thinking Suppression Device (PTSD)?) It can’t be forever. That means there should be a chance we never see COVFEFE. On the other hand, if we assume Trump types forever, then it is obvious that not only must COVFEFE appear, but it must appear an infinite number of times!

Indeed, if we allow the mathematical possibility of eternal typing, not only will COVFEFE appear in infinite plenitude, Trump will also type the entire works of Shakespeare, not just once, but also an infinite number of times. And the entire corpus of all works that can be types in 26 letters sans spacing. Trump’s a genius!

Well that escalated quickly. That’s because The Limit is a bizarre place. Our intuition breaks down.

We still have to decide how fast Trump can type. Maybe two to five letters per second, but not faster than that. But that’s the physicist in me speaking. Keyboards and fingers can’t be engineered for infinitely fast typing. A mathematician might allow one character per infinitesimal time unit. If so, we have another infinity that has crept in. If one infinity was weird, trying mixing two.

Point is, since probability needs assumptions, we need to make explicit all of them. The problem doesn’t do that. We have to bring our knowledge of English grammar to bear, which we always do, and which part of the conditions. It will be no surprise people can come to different answers.

Homework: Assume finite time in which to type, and discrete positive real time to type each letter; assume also the simple characters proposition I gave and then calculate the probability of COVFEFE at t = 0, 1, 2, … n typing time units (notice this adds the assumption that letters come regularly with no variation, another mathematical, non-physical assumption). And then calculate the first appearance by t = 0, 1, 2, … n. Then calculate the expected value (is it even interesting?). After you have that, what happens in n goes to infinity? (It that even interesting?) And can you also have the time unit decrease to the infinitesimal?

Hint. The probability of seeing COVFEFE and not seeing COVFEFE must sum to 1. If n = 1, the (conditional on all these assumptions) probability of COVFEFE is 0, and not-COVFEFE is 1. Same with n = 2, 3, 4, 5, and 6. What about n = 7? And so on?

December 12, 2017 | 18 Comments

Free Data Science Class: Predictive Case Study 1, Part V


This class is neither frequentist nor Bayesian nor machine learning in theory. It is pure probability, and unlike any other class available. And for the right price!

We have all we need if we want to characterize our uncertainty in future CGPAs given only the grading rules, the old observations, and the simple math notions of the multinomial model. I.e., this:

    (5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Given a set of observations of the number in each bucket, we can predict the probability of a new person having each possible CGPA. We are using the data found in Uncertainty, which is in a CSV file here.

CGPA comes with too much precision (I found the file on line many, many years ago, and cannot rediscover its origins), with measurements to the hundredth place. It is therefore useful to have a function that rounds to the nearest specified decisionable fraction. I modified this roundTo function to do the job (we’re using R, obviously).

roundTo <- function(y, num, down = FALSE) {
    resto = y%%num
    # to round down use '<=' 
        i = which(resto <= (num/2))
    } else {
        i = which(resto < (num/2))
    # if you don't think you need binary subtract, try these,
    # which should both give 0; try other numbers in (0,1)
    # a=.88; a + 1 - a%%1 - 1
    # a=.89; a + 1 - a%%1 - 1
    y = y + `-`(num , resto)
    if(length(i)) y[i] = `-`(y[i] , num) 

The reason for the back ticks is given in the comments. Since we're classifying into buckets, floating point math can make buckets which should be 0 into something like 10^-16, which is not 0, and which is also not interesting. Use of the binary subtract function fixes this. If you don't understand the code, don't worry about it, just use it.

Read the data into R (put your path to the csv file into path):

# path = 'C:/mydrive/mypath/' #windows; note direction of slashes
# path = '/home/me/mypath/' # unix, mac
x = read.csv(paste(path,'cgpa.csv',sep=''))

Then apply our function:


 0  1  2  3  4 
 4 17 59 16  4 

I'm mixing code and output here, but you should be able to get the idea. There are n = 100 observations, most of which are CGPA = 2. The model (5) in code (mpp for multinomial posterior predictive):

mpp <- function(x, nd = 3){
  # nd = number of significant digits 
  x = (1 + x)/(sum(x)+dim(x))

This is model (5) in all its glory! Note that this is a bare-bones function. All code in this class is for illustration only, for ease of reading; nothing is optimized. This code does no error checking, doesn't handle missing values; it only spits out the answer given a table as input, like this (the signif rounds to significant digits):


     0      1      2      3      4 
0.0476 0.1710 0.5710 0.1620 0.0476 

Notice there is less than a 59% chance of a new CGPA = 2, but more than a 4/100 chance of a CGPA = 1. The future is less certain than the past! Suppose we wanted finer gradations of CGPA, say to the nearest 0.5:


  0 0.5   1 1.5   2 2.5   3 3.5   4 
  2   3   8  21  33  20   7   4   2  


     0    0.5      1    1.5      2    2.5      3    3.5      4 
0.0275 0.0367 0.0826 0.2020 0.3120 0.1930 0.0734 0.0459 0.0275 

Play with other values of num in roundTo(). We're done, really, with what we can do with (5), except, of course, for checking it on real new measurements. Which I don't have. And which brings up an important point.

The point of the predictive method is to make testable predictions, which we have just done. But we can't test them until we get new measurements. Yes, we can and will check the old data as if it were new, but this is always cheating, because as everybody does or should know, it is always possible to derive a model which fits data arbitrarily well. Schemes which split data into "training" and "testing" sets cheat too if they ever in any way use the results of the testing data to tweak the model. That just is to use all the data in fitting/training. Though there are attempts and supposed techniques to reuse data, the only way to assess the performance of any model is to compare it against data that has never before been seen (by the model).

Model (5) can't be pushed further. But we do have other formal, measured information at hand, about which more in a moment. Of informal, non-quantifiable evidence, we are loaded. We can easily do this:

    (6) Pr(CGPA = 4 | grading rules, old observation, fixed math notions, E),

where E is a joint proposition carrying what you know about CGPA; things like, say, majors, schools, age, etc. Things which not formally measured and even unmeasurable. After all, to what schools, times, places, people does (5) apply? Pay attention: this is the big question! It by itself says all schools, all times, all places, all peoples---as long as they conform to the formal grading rules.

Pause and consider this. (5) is universal. If the old observations came from, say, Sacred Heart Institute of Technology and we knew that, which we don't (recall I found this data maybe twenty years ago from a place only the Lord knows), then we might insist E = "The school is Sacred Heart only". Or E = "The school is like Sacred Heart." Like is not quantifiable, and will differ widely in conception between people. Well, and that means (6) will be different for each different E. Each conception gives a different model!

Again, this is not a bug it is a feature.

Notice that (6) is not (5), a trivial point, perhaps, but one that can be forgotten if it is believed there is a "true" model out there somewhere, where "true" is used in the sense that probability is real or that we can identify cause. We've already discussed this, so if you don't have it clear in your mind, review!

Next time we introduce SAT, HGPA, and time spent studying, and see what we can do with this formal measurements.

Homework: Using (5) and the data at hand, suppose there are n = 20 new students. What can you say about the predictions of the numbers of new students having CGPA = 4, etc.?

December 10, 2017 | 6 Comments

Statistical Consulting

Hi, gang. This is a placeholder page advertising my services as a consultant, speaker, and teacher.

I’m moving things hither and thither, making a place for featured posts, most of which will be of temporary importance, and others will be permanent fixtures. I’ll be shifting the pages about, editing them to make more sense.

Comments are open. The new theme is disconcerting, as all change is, but I think we’ll grow used to it. Suggestions for tweaks are welcome. Recall that the main purpose of this site is to feed me. I don’t have any formal position anywhere, and use this blog as advertising for myself (feel free to insert your jokes here).

The News sticky post will move to a page, and be updated when necessary.

I tried making most of the changes over the slow weekend, in the mornings and evenings. There were some unavoidable interruptions. Apologies for that. Every theme looks great on the samples, until you try them out on your own material, where suddenly many multiples of tweaks are discovered to be needed. I am still making these.

Our Summa Contra Gentiles series resumes next week.

I’ll update this post when and if necessary.

Update One thing that’s possible is all posts can be displayed in toto instead of in excerpt, but this would means two columns of narrow, full posts. It would save people from having to click into an article if all they want is to read it.

December 8, 2017 | 6 Comments

The Substitute For P-values Paper is Popular

The Substitute For P-values paper is popular. Received an email from the American Statistical Association informing me of the unusual viewing activity. The email copies this earlier email (I’m cutting out the names):

No problem! I also wanted to let you know of another article that appeared as one of “Taylor & Francis’ top ten Altmetrics articles” last week (and is still doing well). It’s “The Substitute for p-Values,” by William M. Briggs (Vol 112, Issue 519 of JASA). So far, it’s seen 149 tweets from 143 users, with an upper bound of 150,371 followers! Below is the Altmetric score:

All the best,


I had never heard of Altmetric, but on looking at their list of the top 100 papers of 2015, paper number 100 had a score of 854 (top had 2782). Fame still awaits.

Paper 100, incidentally, was “Human language reveals a universal positivity bias.” Not at this blog, buster.

The main email said this:

Dear Dr. Briggs, I just thought I would make you aware that your comment “The Substitute for p-Values” (

Has been viewed more than 3,000 times and is still very popular on social media (see below).

Thank you so much for your contribution to JASA! [E], ASA Journals Manager

The link to the official paper is above (here too). The original post about it is here. The book page for Uncertainty, which contains all the meat and proofs of contentions in the paper, is here. Uncertainty can be bought here.

Don’t miss the free Data Science course, which puts all the ideas of the paper into action. This course is neither frequentist nor Bayesian nor machine learning/artificial intelligence, but pure probability.

Bonus correlation!

Just look at that! The editors “best books” next to readers’ favorite book. The p-value measuring this correlation must be mighty wee! Weer than wee! Wee wee. All the way home!