On The Evidence From Experiments: Part I

Bob spelled backwards is Bob.
This is going to start slow, unwind at a leisurely pace, and finish, for some, with a depressing conclusion, though we won’t get there today.

Here’s the setup. You’ve invented a new treatment to cure cancer of the albondigas and want to compare it to a placebo. Forget the ethics: this is a gedanken experiment.

Two groups of people: one gets your treatment, the other Obamacare. Kidding! The second is given placebos. Since this experiment is taking place in our imaginations, suppose there are no differences in the two groups, except one: whether or not a person has the treatment or a placebo.

Don’t gloss over no differences, because this is the key. It means what it says: no differences. The people are therefore identical in every possible way. They are all men (say), of the same age, born in the same town of the same parents and have eaten the same food down to the last morsel over their perfectly interchangeable lives. Their burden of disease is the same.

They have all said and thought the same things and have had the same things said to each of them at the same moments. They have breathed the same air, read the same books, drank the same whiskey. They are all named Bob. In short, there is nothing to differentiate one Bob from another. As in nothing. Get it? Nothing.

Half the Bobs get the treatment and half the placebo. While they’re getting their meds, their lives are as identical as before. The pills are popped simultaneously, administered by the same nurse while sitting in the same doctor’s office. With a proviso, no differences but the form of treatment up to and including the specific time and date (fixed in advance) which they are measured for a cure. There is no chance of measurement error.

Only one of four things can happen. (1) Every Bob is cured, (2) no Bob is cured, (3) every treatment Bob is cured and no placebo Bobs are, (4) no treatment Bob is cured and every placebo Bob is. It’s always all or nothing because the patients are identical down to the quark, or string, or whatever is at base. It cannot be that only some treatment Bobs are cured while other treatment Bobs languish because we assumed that there was no difference in treatment Bobs, not only before they received their meds, but all the way up to the time they were measured for a cure. Thus if the treatment “works” it must do so on all treatment Bobs or none of them. The same goes for placebo Bobs. The subject of this essay is to describe what “works” means and the evidence we have for it.

Outcome 1. What cured the Bobs? It can’t have been the treatment per se, because all Bobs got better. Was it the power of their minds via the placebo effect, since the treatment was as if it were a placebo? Or were they getting better anyway and would have been cured regardless of the experiment? We don’t and cannot know. Both hypotheses are possible.

The proviso hinted at above is now important. The treatment, since it is not the placebo, must have caused some difference between the Bobs, unless the Bobs’ bodies handled the chemicals in the treatment in precisely the same way as the placebo. This is not an impossibility. We can’t know if it’s likely, though, not with the evidence we have.

Perhaps the treatment hastened the cure in the Bobs that got it. We’d never know because we measured all Bobs for a cure at the same time. Or maybe the treatment impeded the cure, but not so much that it was noticeable. Point is, we cannot say anything about the treatment, for good or bad; not with the evidence we have.

Outcome 2. We can say with certainty that the treatment cannot cure cancer of the albondigas in the given time frame for people who are exactly like the Bobs. But that’s not an especially important thing to say. Why? Because how many people (beside the Bobs) are just like the Bobs? None. The treatment might cure not-Bobs.

Don’t be trapped thinking that the “degree” not-Bobs are different from Bob increases their chance the treatment might affect a cure in them. For one, there is no practical way to measure all the differences between Bobs and not-Bobs; in practice we’ll always have to choose some finite number of measurable attributes and can only speak of the differences in these. That none of these are different between some Bob and some not-Bob does not mean there are no other important differences. Second reason: we have no evidence, except hope, that the treatment is of any use.

We still can’t say what changes the treatment might have caused, assuming that the Bobs processed the treatment differently than the placebo. A cure might be forthcoming at some point after the official measurement period for just the treatment Bobs or just the placebo Bobs. Or a cure might forever be blocked in the treatment but not placebo Bobs, or vice versa. Or the treatment might have worked: every albondigas of every treatment Bob might have gone into remission at some point before measurement day, only to re-cancerify once more. We can’t know; not with the evidence we have.

And these are not the only scenarios for the two outcomes. Many more can be thought of, all equally evidence free.

Part II: Outcomes 3 and 4 and more!


  1. “They have breathed the same air, read the same books, drank the same whiskey. They are all named Bob. In short, there is nothing to differentiate one Bob from another. As in nothing.”

    This is not only impractical, it is impossible. The Bob’s can’t all be in the same place at the same time.

    I think I know the conclusion you are headed for and it likely applies equally well to all knowledge and hypotheses about the universe.

  2. Well…. of course in real experiments all the “Bobs” aren’t exactly totally the same as each other. They are more like examples of a larger class of things whose differences are somehow typical of that class. We can get into arguments about what constitutes a the class, and whether there may be identifiable subclasses, but even in the case where one is trying to do statistics to learn “the effect of increasing the temperature during manufacturing’ on ‘weight of widgets produced on the assembly’ no one claims that other than temperature, absolutely everything other than temperature during processing was was absolutely identical for every individual widget. It is recognized that there will be variations even when temperature is held constant and these variations are due to something which we may be able to know, identify and ultimately control, or which may be unknowable, unidentifiable or and uncontrollable (or it situation may be somewhere in between.)

    But what we hope (and generally assume for the purpose of analysis) is that at least these variations are typical for ‘our machinery in our physical plant’ and the also might be typical for ‘all our widget machines installed in any arbitrary physical plant’ and so on. (And we might try to figure out whether the broader generality or the narrower one is warranted.)

  3. In this case, all we need is two Mr. Bobs, and thank them for being such a trooper. One receives the treatment and the other the placebo. No need for two troops of them.

    What does evidence free mean? “Absence of evidence is not evidence of absence,” and evidence of absence is not evidence-free.

  4. @lucia, DAV
    Variation among the Bobs will only make matters about to be discussed worse. In a gedankenexperiment it is vain to quibble over how many Bobs can dance on the head of a pin.

  5. Is the medicine identical in quality? Also, are the placebos identical in quality. It seems to me if a manufacturer is intentionally making pills which do nothing, variation in composition might not be as well controlled. Maybe I would end up with the sniffles.

  6. Ye Olde Statisticians,

    Variation among the Bobs will only make matters about to be discussed worse. In a gedankenexperiment it is vain to quibble over how many Bobs can dance on the head of a pin.

    In what sense? It’s normal to recognize that all Bobs (or widgets) cannot be absolutely identical to each other in all ways and yet somehow react differently to something. I don’t think this is a “problem” nor does recognizing this make things matters “worse”.

    If you are alluding to the fact that some analyst may be tempted to hunt for explanatory variables, over fit and not account for all the various number of different things they tried to use to “explain” when interpreting significance: I recognize that happens. But that practice is not caused by the fact that all the Bobs are not actually, litterly identical in all ways. It is caused by analysts having a tendency to trick themselves.

    I do not believe that recognizing that all items in the sample are not literally, absolutely, positively identical in all way (a truth) makes matters “worse” than thinking that they are (which is false).

  7. @lucia
    I’m pretty sure that Dr. Briggs is going to discuss the difficulties that exist EVEN IF all those variations did NOT exist. Hence, the identicalness of the Bobs is a given of the thought-experiment. What he is discussing is IN SPITE of the equivalence of Bobitude. Throw in varibobiation and it gets much more iffy.

  8. Ye Olde Statistician,
    I’m not sure what he’s going to discuss. I would not be surprised if you are incorrect. Given what he’s written in the past, I suspect he is more likely to point out that the reason the individuals in the medical study responded differently to medicine is precisely that they are not all absolutely identical and– possibly– that the notion that all the Bobs really are identical can cause problems. (At the same time, excess overfitting based on every possible identifyable difference can also cause problems in interpretation.)

    But we’ll see.

  9. Where is this heading Briggs? Based on your other stuff recently, I suspect it is more criticism of the tools and information I use as a clinician and researcher. These kind of posts is quite depressing to me, who spent the last few years busting my butt to finish a Masters degree in clinical epidemiology.

    I dont mind taking the medicine even if it is bitter, assuming it makes the ptient better.



  10. Richard Feynman recounts how he taught physics to Brazilian students who simply memorized methods to solve textbook problems. They were then able to teach other students but not having any actual understanding of the subject were incapable of solving real-world problems. My impression of Mr Briggs’s endeavour is that he hopes to cure this condition in the area of statistics. At least that’s how I approach his pedagogical efforts like this one.

  11. YOS,

    Variations in experiments ALWAYS exist. The subject doesn’t matter at all. If it presents difficulties in a medical study then the same difficulties arise in physics. This also includes the assumption that past performance carries over to future performance which is one of the reasons (of course, YMMV) you don’t believe you’ll fall through the floor when you step out of bed — you never have.

    I don’t have any real idea what Matt is up to but if if amounts to We have ignored things so what we claim to know isn’t quite true then I think it deserves a resounding “So What?”

  12. “Only one of four things can happen. (1) Every Bob is cured, (2) no Bob is cured, (3) every treatment Bob is cured and no placebo Bobs are, (4) no treatment Bob is cured and every placebo Bob is. It’s always all or nothing because the patients are identical down to the quark, or string, or whatever is at base”

    You are assuming that probability and randomness can come only from unknown facts. That is, if everything is the same then the result will always be the same.

    I think this determinism is not valid. Even if everithing is the same we could still have different outcomes, and that would not be because we did not account for everything. See, not accounting for everything is akin to randomness, but the converse is not true. Knowing everything does not elimante randomness.

  13. Variations in experiments ALWAYS exist.

    Except in THOUGHT-experiments. If a difficulty exists in such a thought experiment, the real-world variation will only make it worse; not magically better.

  14. If a difficulty exists in such a thought experiment,

    Not necessarily. It’s possible to concoct a thought experiment that contains difficulties that do not exist in real world problems. Often people don’t do that– but sometimes they do and there can be good reasons for either as a method or explaining limitations of methods.

  15. Despite my Gedanken Bobs being identical, I can’t seem to get all of the Placebo Bobs to be cancerous or cancer-free. Nor, can I get my Treatment Bob’s on the same page.

    Despite being identical at the start of the experiment they are not identical by the end of the experiment.

    I think it happens when they go home after treatment and pet their Schrodinger’s cats, and some have a live kitty to pet, and some have to bury their cat’s corpses.

  16. @laura
    But in this case, Dr. Briggs seems to have concocted a thought experiment in which he has removed certain difficulties. Add them in later to make the conclusions even less certain. The basic problem would seem to be the basic problem of all natural science: that of asserting the consequence.
    M: P→Q
    m: Q is observed,
    .: P is hmmm.

    The magic elixir is given to Bob who is suffering from cancer of the albondigas; and lo! Bob gets better.
    ● Did he get better because of the magic elixir?
    ● Because of something annexed to the magic elixir?
    ● In spite of the magic elixir?
    ● Coincidentally with the magic elixir?

    This is a logical problem lying underneath all the statistical foo-foo regarding natural and artificial variations that cloud up whether there has been an improvement at all. How do we assign an effect to a cause? Did the retrograde motion of Mars “prove” the efficacy of epicycles?

  17. Ye Olde Statistician
    Is @laura me?

    On this:

    But in this case, Dr. Briggs seems to have concocted a thought experiment in which he has removed certain difficulties

    How do you know? He introduced the logical difficulty that even though the Bobs are really, literally, totally completely identical in all ways, they respond differently to medicine. This introduces a difficulty that is not present in any real experiment, ever. One difficulty is that this is not actually true:

    But that’s not an especially important thing to say. Why? Because how many people (beside the Bobs) are just like the Bobs? None. The treatment might cure not-Bobs.

    If we found a population of ‘true Bobs’ all of whom are correspond to the absolute quintessense of what it is to be a Bob, there may be jillions of Bobs. After all, we found a bunch of them in our experiments. Why would we think there are no other Bobs. In fact, maybe there are tons and tons and tons of “Bobs” out there, the existence of Bobs is essential to the survival of the universe, since all Bobs are identical, we can be sure all these Bobs have cancer. So, if we can cure them, that would be a good thing.

    So, it’s possibly very important to learn how to cure Bob. The only real problem is that if all Bobs are actually identical and all will respond in exactly the same way, we wasted money testing many Bobs. We could have just tested 1 Bob. Afterwards, we could just treat all the rest using any cure found to work on even one Bob. And in this Gedunkan, we could be confident in the notion that– since all Bobs are identical (at least before treatment), according to the rules of this Gedunkan all will be cured. (Well.. of course, my statement assumes the lapse in time between treating the first Bob and all the others is recognized to be a difference.)

    It’s true we learn little about treating not-Bob. (Though we might speculate.) But that’s not necessarily a “problem”. By definition we can’t learn everything about everything from one experiment. We merely need to know that we have only learned how to treat Bobs. And there sure seem to be plenty of them in this Gedunkan because we found loads of them.

  18. Ye Olde

    How do we assign an effect to a cause?

    Who says one is required to do so? Anyway, infering causes should be done purely based on statistics. This is also not a problem.

  19. YOS,

    Science proceeds by falsification, according to Popper.

    M: ¬P→Q
    m: ¬Q is observed,
    .: P.

    ¬P is your null hypothesis – the thing you’re trying to disprove.
    Q is your test interval.
    ¬Q is a significant result.
    P is that the null hypothesis is rejected.

    So in the experiment, the null is that the treatment has the same effect as a placebo. (i.e. P is that it has a different effect.) If the treatment has the same effect as a placebo, outcomes 1 or 2 should be observed. Outcomes 1 or 2 tell us nothing.

    M: ¬P→Q
    m: Q is observed,
    .: nothing about P.

    However, outcomes 3 or 4 in this story *would*. I am guessing Briggs set up the problem and covered the boring cases first, because what he really wants to talk about is whether a significant test result tells you anything either.

    And given the long-running theme of this blog, I’d say it was likely to be something to do with why a significance test doesn’t tell you the probability of the conclusion being true. Although I’m not quite sure how he’s going to do it, given that making things absolutely deterministic like this breaks most of the arguments for saying so. There may be some argument that doesn’t rely on randomness that he want to introduce to us.

    There are of course several standard ways of breaking the “correlation implies causation” heuristic. ¬P→Q is logically equivalent to Q→P, which suggests that the remission of the cancer caused the patient to get the treatment rather than the placebo. (Of course, logical implication is not the same relation as causation, a mistake which catches many people out. But play along…) And then there’s C→¬P and C→Q, that some common cause both cured the cancer and handed out the treatment.

    Since we haven’t been told how the treatment/placebo groups were selected, it’s not something we can rule out – at least, not in a thought experiment. But I’d be disappointed if it was anything so trivial.

    However, this is all idle speculation. We’ll have to wait and see.

  20. NIV,

    Only saw yours (so far), but Popper is wrong. See this series, this, this pdf, and many others. See esp. John Searle; e.g. this.

    Aside: I notice my WordPress search has a recency bias and didn’t catalog posts older than a certain date where I certainly used these keywords in posts. Have to get a better search. For instance, I have a nice quotation from Searle on Popper which I cannot find. I haven’t kept up my Classic Posts page, to my shame.


    It is fun, but odd and besides the main point, to say of a thought experiment, “I’m going to substitute my thought experiment for the hosts’ and discuss that instead.”

  21. @Nullus
    It’s kinda funny to see Pope Urban’s objection to the physical reality of the Copernican model, the very argument Galileo ridiculed at the end of the Dialogue show up today as orthodox science.

    Kepler and Newton and them were a-trying to show what does cause planetary motion, not what does not. Popper was a mathematician, and his championship of modus tollens was part of a broader program to undermine the scientific certainty of the old positivist approach. (He also liked to put “success” words in scare quotes.) Much easier to accept that physical science never achieves the certainty of mathematics, even when it is modeled by mathematics.

    M: ¬P→Q
    m: ¬Q is observed,
    .: P.

    M: Heliocentrism → visible stellar parallax
    m: No stellar parallax is observed
    .: Therefore, heliocentrism is falsified.

    Might could be there is more to it than that, as Duhem and Quine showed. There is never just one P, and when Q is negated, how does one know which P has been falsified.

Leave a Comment

Your email address will not be published. Required fields are marked *