The Lady Tasting Tea: Bayes Versus Frequentism; Part III (update)

Read Part I, Part II. The text has again been expanded and corrected.

We have our model in hand. “Has the ability”, our model, says (see Part II) that the lady can guess any number of the N cups correctly. All the lady knows is that N is divisible by 2, that she will see an equal number of milk-first and tea-first cups. She will receive no feedback on her guesses. Thus, we do not assume (initially) she will employ an optimal guessing strategy.

What is an optimal guessing strategy? Suppose we gave the lady feedback and told her whether her guesses were right or wrong as the experiment progressed. If, say, the first four cups were all milk-first and she knew she got these all correct, even if she has no ability and did so just by guessing, then (if she was paying attention) she ought to get the last four correct, too (even before tasting!). My experience with ESP testing suggests most people not use optimal guessing strategies, but if they did we can account for it, though it’s not easy to do so. So for ease, we’ll forbid feedback.

Recall that, in Bayes, all probabilities are conditional, so that we need to be clear about what premises we are conditioning on. All probabilities are conditional in frequentism, too, but this is not acknowledged, so the premises are often hidden (which is one path to over-certainty).

Question 1 Given this model (and only our other premises), and before running the experiment, what is the probability the lady guesses 0 right, 1 right, 2 right, up to N right? This question is equivalent to asking what fraction of cups she will guess correctly: 0/N, 1/N, up to N/N. It is not equivalent to asking what sequence of correct and incorrect guesses she will evince. The fraction of correct guesses is easily answered, for 0, 1, …N is 1 / (N+1), 1 / (N+1), …; that is, the probability that she guesses j cups correctly is 1 / (N+1) for j = 0, 1, …, N.

Stated yet one more way, since we have assumed as a premise the model that she may guess any number of cups correctly, the probability that she does so is 1 divided by the number of possibilities. (That last statement is not assumed, but is derived: those who want the full-blown mathematical details may download this paper, which itself relies on this paper.)

Question 2 Suppose we run our experiment for 2 < N cups and are interrupted (in the paper linked above, we use testing nuclear reactors instead of cups of tea, where interruptions are common). Given our model and premises, but also given her guesses up to this point, what is the probability that she guesses 0 cups right, 1 right, up to N – 2 cups right? The exact answer has a simple mathematical form (given in the first paper linked). But the real point of interest for us is that this answer exists naturally in Bayes, but not in frequentism, another major criticism.

Question 3 The experiment is finished! She has guessed M correct out of N (M is a sum of the correct milk-first and correct tea-first cups). Here is a non-trick question: Given our model and given M, what is the probability that she guessed a fraction K / N correct, where K does not equal M? It is 0, or 0%. A silly question to ask, yes, but let’s expand it. Same premises: what is the probability she guessed a fraction M / N correct? It is 1, or 100%. Another silly question, trivially answered. So why bother?

Frequentist theory would have us ask something like this: what is the probability that she guessed (M + 1) / N correct, and the probability she guessed (M + 2) / N correct, and (M + 3) / N correct, up to N / N correct? In Bayes, the sum of these probabilities is 0, as we just agreed. But not in frequentism, where the meaning of the word “guessed” is changed. It no longer means “guessed” but “Might be guessed were we to embed the experiment in an infinite series of experiments, each ‘identical’ with the first but ‘randomly’ different; we also hypothesize that if we were to average the correct guesses of this infinite stream, the result would be precisely N / 2 correct guesses.”

In other words, frequentist theory demands we calculate a probability of what could of—but did not—happen in “repeated trials” (where “repeated trials” is shorthand for “embedded in a sequence of infinite repetitions”). The theory must also hypothesize a baseline, a belief that the infinite sequence converges to some precise average (here, N / 2 correct guesses). Stated differently, frequentist theory asks the probability of seeing results “better” or “worse” than what we actually saw, given the model is true, a value for the baseline, and M.

This violates our agreement that we should use only the evidence from the experiment (and knowledge of the experimental set up) to just the truth of our model. Frequentism does not make statements about what happened, but what might have happened but did not in experiments that will never do conducted.

This probability is the P-value. If the P-value is “small”, the hypothesis that the baseline is N / 2 is “rejected”, i.e., it is believed to be certainly false. I mean certainly in the sense of certainly. The P-value does not give a probability that the baseline is false: it instead asks you to believe absolutely in the truth or falsity some contingent hypothesis (i.e. that the “baseline = N / 2”). In other words, a decision based on the P-value implies that the probability of “baseline = N / 2” is 1 or 0 and no other number. A subtle, but damning, criticism is that (except in circular arguments) no contingent hypothesis can be certainly true or false, so the use of the P-value is immediately unsound.

Harold Jeffreys (homework from Part I) said, “What the use of P [values] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.” Your homework this time is to explain this quote in the tea-tasting context. Why ask for probabilities of events that did not occur? Why is the P-value (see Parts I and II) not the answer to the question, “What is the probability she has the ability?”

In Part IV: “Hey, what about Fisher’s exact test! Surely that fixes frequentism?” It does not, and don’t call me Shirley.


  1. Devil’s Advocate Says:

    Let R be the subset of the N cups she gets right, so what you are calling j is the number of cups in R. Note in passing that we can obtain j from R, but we cannot obtain R from j alone. There are 2^N possibilities for R; we derive that the a priori probability of each is 2^(-N). Since j is a function of R, the probability distribution for j is thus determined: it is binomial with parameters N and p=1/2.

  2. I hinted at this earlier but so far I don’t see any progress toward answering the question of the lady “possessing the ability” which was the point (I thought, anyway — perhaps Can we use this experiment to discover whether the lady has the ability she claims? may have misled me). In system engineering it would be necessary to state a minimum performance. IOW: you need to clearly define what is meant by ability. I hope you aren’t going to say that she possess k/N probability of “having the ability” (which you discovered during the experiment) because that’s the answer to a different question. Worse, you appear to be leaning toward “anything goes”. Now you did list a number of possible meanings of “have and ability” but if you were writing requirements, you gotta pick one. You never did that IMO. “I’ll know it when I see it” doesn’t cut it.

  3. Can we use this experiment to discover whether the lady has the ability she claims?

    Seems to me William is moving toward a philosophical position where the answer to this question will be, “There is no possible experiment that will answer this question” because all knowledge is probabilistic and no probabilistic statements are either true or false. Or, equivalently, everything is uncertain and uncertainty is expressed through probabilities.

    And, much like solipsism, is a position unassailable by logic but which no philosopher has ever been known to adopt outside his study.

  4. Rich,

    Not at all; in fact, it is the opposite of solipsism. I adopt the same position that Aristotle did, and that nearly all people use “outside their studies.” All contingent statements can be known to be true only probabilistically. Now, these probabilities can be as close to one as you like, but they are still not one. Outside your study there might be a car in the parking lot. You see it and, conditional on that evidence, and on the evidence of your memory, you conclude, “That car is mine.” For most actions in life, this is close enough, even though this is a contingent statement.

    But since it is possible for both your eyes and your memory to go on the fritz, you would have done better saying (e.g.), “The probability that car is mine is nearly 1.” You usually then condition your future actions with the new premise “That car is mine” and you will only rarely (and maybe never) make a mistake. (Remember, all probability and all knowledge is conditional, even the a priori, which is conditional on our intuitions.)

    Now, if you can remember, I often make statements where I assert that we know very many things with certainty: all a priori knowledge fits this category, but so do many, many conditional statements. All of our proved theorems based upon a priori axioms fall into this category. All of our knowledge of ordinary outside-the-study logic fall into this category. And there are many more. I am as far from solipsism as you can get.

    As DAV suggests, with Bayes we are moving toward a position where we will be able to say that the statement “the lady has the ability” will only be true with a certain probability. But since all our contingent hypotheses are like this, this is hardly controversial. This position has the decided benefit that it answers the very question we want to know, albeit with a probability. It has the additional (priceless) property that our model can be put to the test and its value decided (even though we only know this model via probability).

    But frequentism answers a question nobody wants to know, or at least one which is not the question before us. Plus, all the other arguments I listed against frequentism in these three posts.

    I notice that, for the second post on this subject in a row, we have had no defenders of frequentism attempt to refute the criticisms against it.

  5. Been out of pocket, so I haven’t caught up yet.

    “But frequentism answers a question nobody wants to know, or at least one which is not the question before us. ”

    Point one is your opinion, not a fact. Lots of people want to answer the questions that frequentism can. With regard to your second point, you purposely posed a question that frequentism is not designed to answer, and I completely disagree that the question you pose is better or more important than the ones that frequentism can answer.

    I have no problem with Bayesian methods and the insights they can provide. But the wholesale dismisal of frequentism that you seem to advocate is utter silliness.

  6. Mike B,

    I do advocate abandoning frequentism. For one, it is incompatible with Bayes; a seemingly small criticism. But all science, math, and philosophy is the search for dogma (the ultimate truths). And once we find it (which is rare enough) we should use it, and not something else merely because it is familiar or its alternative is not well understood (I mean the philosophy and not the mechanics). Now, I won’t debate this here, because it isn’t the place. We could go on and on about this, but we will have ignored what is of interest: the difference in the two theories of probability.

    Of course what you say is true: frequentists do want to answer the question, “What is the probability of events that did not happen?” instead of “What is the probability my theory is true.” But since the former is logically unrelated to the latter, I should have said, “Nobody should want to ask this question.”

    I invite you to answer the criticisms made in Parts II and III, particularly those related the Jeffreys’s well known objection.

  7. I won’t debate… the difference in the two theories of probability.

    As far as I know, there is only one theory of probability.

    Question 3:… Another silly question, trivially answered. So why bother? Only to illustrate that the question is trivial in Bayes, but not in frequentism.

    It’s trivial to any statisticians. I don’t see why the question has to do with being a frequentist or Bayesian.

    Both Frequentists and Bayesians want to make inference about some unknown quantities based on collected data. If one wants to describe or summarize the already observed data, which is what you are doing in Question 3, simple graphical/numerical descriptive statistics will suffice… no need to ask the probability question, no inferences to be made, hence no frequentism or Bayesiansism involved. And yes, it’s a silly question.

  8. JH,

    There are, in fact, many theories of probability. A decent book, now somewhat out of date and out of print, is Terry Fine’s Theories of Probability. Fine is another Cornellian; I’ve had discussions with him about this (now years ago) and it is his opinion that there are now three main rivals: Bayes (in its subjective form), frequentism (not exactly von Mises, but via Kolmogorov), and (what I’ll call) “computer science” which is a “whatever works” kind of pragmatism. The statistician Glen Schaefer advocates something like the last, but in a sophisticated way. There is also a book by Lous Narens Theories of Probability, which I haven’t read. Also look at the book that was last donated to me (link to the left). Of course, I advocate something very similar to Jaynes, Cox, and Stove.

    Had a go at Jeffereys’s criticism yet?

  9. Yes, Bayes employs “probability”, but I don’t think it is a probability theory. This is the probability theory that I know. And I am not talking about all the theories proved within the book or any book.

    Jefferey’s criticism? Been a bit stressed out to think about it… Please post something about it again after April 15. ^_^

  10. JH,

    Billingsley (great book) is, of course, based on Kolmogorov. Let’s also not forget de Finetti.

  11. I came upon an article about a paper (pdf) that claims that people can discriminate between criminals and non-criminals just by looking at them. The authors are from Cornell, too.

    The most detail we get from the article is this:

    Valla et al. simply ask their experimental participants to indicate how likely they think it is that each man is a certain type of criminal (murderer, rapist, thief, forgerer, assailant, arsonist, and drug dealer) on a 7-point Likert scale from 1 = extremely unlikely to 7 = extremely likely. Their results from two experiments consistently show that individuals can tell who is a criminal and who is not, by indicating that they believe the actual criminals have higher probability of being a criminal than actual noncriminals.

    They do have a link to the paper, which is not behind a paywall. And they appear to have quoted the scales wrong (1-9, not 1-7). But I did find the money quote from the paper:

    A series of mixed effects ANOVAs, with Subject entered as a random effect, and
    Photo Category (Criminal/Non-Criminal) and Attractiveness entered as fixed effects,
    revealed that criminals were rated as significantly more likely than non-criminals to have
    committed murder, rape, theft (p’s < .0001), and forgery (p = .04).

    So you know they’re right. Just look at the p values!

  12. I have the strongest feeling that the meaning of some of these words is changing as we watch. For example: As DAV suggests, with Bayes we are moving toward a position where we will be able to say that the statement “the lady has the ability” will only be true with a certain probability. But since all our contingent hypotheses are like this, this is hardly controversial. This position has the decided benefit that it answers the very question we want to know, albeit with a probability.

    “Know” and “answers” seem a little fluid. It seems clear to me – alone on my island maybe – that outside the study a statement like, “the lady has the ability with probability p” would not be taken as an answer. A likely response would, I think be, “Well has she or hasn’t she?” or maybe if p=0.9, “so you’re saying she does have the ability?” and with p=0.1, “So are you trying to tell me she doesn’t?”

    In the study the niceties of probabilistic statements are fine but outside we have to make a decision (maybe there’s a prize). Yes, I know non-statisticians don’t think clearly and they should but I have to live with them.

    Perhaps I’m a bit sensitive to these issues. I fell out with a manager once who asked me, “How many incoming phone lines do we need so no-one gets a busy signal?” I responded with, “What probability of a busy signal is acceptable? 1 in a hundred? 1 in a thousand?” He replied, “Zero”. When I said, “In that case, one line for each person who wants to call us” he told me I was stupid and to give him a sensible answer.

    We’re outside the office and it’s a jungle out here!

  13. JH,

    Amen, sister. I am in Rome, waiting for you.


    That paper reminds me of a George Carlin joke about how to get out jury duty. Just tell the judge, “I can spot a guilty person just like that!”


    The phone line example is perfect (and Bayes!).

  14. Rich,

    Your answer regarding the number of phone lines was indeterminate and not actionable. What could a manager possibly do with such an answer? Clearly, the correct answer was zero.


    Why would they arrest the guy if he were innocent?

  15. Lets say that we set the “has the ability” test to be that the lady calls each cup of tea correctly. And let’s say that we ran the test with 8 cups of tea, with the assumptions you listed before, and she correctly called each cup of tea. Under Bayesian statistics, would we say that she has the ability with a probability of one, dependent on our assumptions, or would we say that she has the ability with a probability of 1-(1/256), allowing that maybe she was just lucky?

  16. I am puzzled . Interested (what leads me to reading) but puzzled .
    While I am reasonably skilled with operations on measurable sets even in the case when the chosen measure is called a probability , I apparently am not a statistician .
    I even admit that as a physicist , I can’t really understand where these distinctions between frequentists and bayesians actually matter .
    Questions that are supposed to be important aren’t to me .
    The right answer on the phone line question ?
    We need 2356 lines . How did I find out ?
    I have checked that the maximum of incoming simultaneous calls registered at that location since the company exists was 1178 . Then I applied the rule that we use in design of nuclear safety systems . I multiplied the above number by 2 .
    I could round up to some nice number (like 3000) if the pricing was agressively degressive with the quantity .
    Do you want to falsify this number ? So just try . You will see that your whole life won’t be enough to do so . Therefore it is the right answer .
    Yes there are priors that I am well aware of and I would say them to the boss . The main being that the answer is valid as long as the company’s business and organisation don’t change by more than some number that I can give (f.ex the turnover doesn’t increase by more than 2) .
    Clearly the answer given in the original example was just (provocatively) lazy and the boss was right to be angry – I would have been too .
    Etc .
    Has the lady the ability ?
    Obviously the answer can only be “yes” or “no” on THIS question .
    “Perhaps yes” or “Perhaps no” are not options because they would just mean that I have not a clue .
    Per definition the possession of an ability is binary – “yes” or “no” .
    The very fine point being that if the right answer is “yes” , there is no way to be sure by making any tests during finite time .
    If the right anwer is “no” then a single test would be enough to prove that “no” is the answer and we can send the lady home , with the “yes” there always will be doubt .
    But sure , the intensity of my doubts would decrease after decades and decades of successful (e.g N right answers on N cups) tests .

  17. Ooh Tom, you certainly know about provocative. My manager got the sensible answer first time. The provocative answer was supposed to provoke a realization about the problem but, in my judgment, he didn’t want to think. The standard engineering analysis of your solution says there will be a 1.9366e-200 probability that someone will get a busy line. Small, true, but infinitely larger than zero which was the design spec.

  18. JH and Rich

    I didn’t say that I had an economical answer 🙂
    But I had a technical one which was what was asked for .
    Btw Rich I would add to any such design spec a definition (like the fine print , you know) saying :
    For design purposes and under the full responsibility of the design engineer if any X < 10^-100 then X=0 .

  19. Well, if the tea lady understood Fourier’s heat transfer equation she would know with 100 percent certainty the difference between early and late milked tea simply by the temperature/time relationship. She would not need statistical bafflegab to obfuscate the obvious. Science is a great tool. Junk science just adds layers of confusion…

  20. So what I’d like to know is this (and please forgive me if I should have deduced the answer from the article, but it’s beyond my ken):

    Irrespective of Granny’s grasp of heat transfer equations, let’s say that things start to take a serious turn and we are to bet on whether or not Granny can repeat her performance of correctly assessing 10 cups of tea out of the 10 she tried.

    Granny is seated, 10 cups of tea are poured — what price does the Bayesian bookie place on the proposition that Granny goes 10 for 10 in the next round? The frequentist?

  21. Big Mike,

    Just the right question. See Part IV.


    Fourier forsooth! Knowing the temperature of the milk and tea would not tell what is poured into the cup first. Unless you claim that microscopic change in temperature in the jug or urn as it sits waiting to be poured until the first one is poured out can be discerned by any human tongue.

  22. Hi,

    Lotteries and casinos are great verifications of frequentism. The hypothetical limits do not seem to be in any danger of suddenly changing overnight.

    I’m all for using frequentism to evaluate the long-term properties of Bayesian methods, and retaining any that have good properties.

    Ideally your n is large, and both “competing” methods basically agree anyway.

    Justin Z. Smith

  23. Justin,

    Lotteries and casinos are great verification of logical probability, I.e. Bayesian probability. The long run works out because the short run does. Remember, you’re starting any casino problem with deduced, that is, logicla probabilities, not frequencies.

  24. Hi,

    Not being argumentative here (tone is lost on the ‘net) but how would you go about something like this: Say we create our own casino game. We haven’t decided on all the rules, but it will use a coin that has two sides and an edge. We have not flipped it yet. We try to ascertain


    Justin Z Smith

  25. Hi,

    Another question I’ll throw out for discussion is, do Bayesians have any philosophical issues with using MCMC?

    I’ve only used it a few times in grad. school, but I ask because it has always seemed frequentist-y to me. I see it as somewhat analogous to the frequentist issue of when to stop for n in

    P(A) = lim n–>oo #A/n

    to get a good approximation of P(A). Say A is the outcome of Heads in a coin flip.

    For example:

    -Use a burn-in period?
    (make coin flips > some small number, since relative frequency is “rough” for a small number of flips)

    -Use more iterations?
    (flip the coin more times, you know it will have a better chance of convergence)

    -Use more chains?
    (flip more coins, multiple evidence of converging is better evidence of existence)

    -Starting with a different seed?
    (and if still converges with different seeds, this is like entering a ‘collective’ randomly and still getting the same relative frequency)

    Justin Z. Smith

  26. Can I ask an idiot’s question!

    Prof Briggs says:
    “The fraction of correct guesses is easily answered, for 0, 1, …N is 1 / (N+1), 1 / (N+1), …; that is, the probability that she guesses j cups correctly is 1 / (N+1) for j = 0, 1, …, N. ”

    There is no j in the equation 1 / (N+1) to alter for j = 0, 1, … , N.

    I assume the equation should be j / (N+1), but an concerned that saying this will show just how ignorant I am being.

    But if you never ask you never know!

    What should be changing as j changes from 0 through to N?

  27. A nicely done overview of PPC. I haven’t personally taken part in one just yet, so I’m trying to learn some of the basics.

Leave a Comment

Your email address will not be published. Required fields are marked *