Sampling Variability Is A Screwy, Misleading Concept

A statistician about to draw snakes from an urn.
A statistician about to draw snakes from an urn.

Because of travel and jet lag, exacerbated by “springing forward”, we continue our tour of Summa Contra Gentiles next week.

If you can’t read the tweet above, it says “Pure data analysis cannot kill inference. Sampling variability cannot be hidden!!!” This was in response to my “Journal Bans Wee P-values—And Confidence Intervals! Break Out The Champagne!” post.

Sampling variability is a classical concept, common to both Bayesian and frequentist statistics, a concept which is, at base, a mistake. It is a symptom of the parameter-focused (obsessed?) view of statistics and causes a mixed up view of causation.

There’s 300-some million citizens living in these once United States. Suppose you were to take a measure of some characteristic of fifty of them. Doesn’t matter what, just so long as you can quantify it. Step two is to fit some statistical model to this measurement. Don’t ask why, do it. Since we love parameters, make this a parameterized probability model: regression, normal or time series model, whatever. Form an estimate (using whichever method you prefer) for the parameters of this model.

Now go out and get another fifty folks and repeat the procedure. You’ll probably get a different estimate for the model parameters, as you would if you repeated this procedure yet again. Et cetera. These differences are called “sampling variability.” There is no problem thus far.

Next step is to imagine collecting our measurement on all citizens. At this point there would be no need for any statistical model or probability. Our interest was this group of citizens and none other. And we now know everything about them, with respect to the measurement of interest. Of course it depends on the measurement, but it’s not likely that every citizen has the same measurement (an exception is “Is this citizen alive?” which can only be answered yes for members of the group now living). The inequality of measurement, if it exists, is no matter, the entire range of measurements is available and anything we like can be done with it. Probability is not needed.

So why do I say sampling variability is screwy?

Why did we take the measurements in the first place? Was it to learn only about the fifty citizens polled? If that’s true, then again we don’t need any statistical models or probability, because we would then know everything there was to know about these fifty folks with respect to the measurement. There is no need to invoke sampling variability, and no need for probability.

If our goal wasn’t to say something about only these fifty, then the measurements and models must have been to say something about the rest of the citizenry, n’cest-ce pas? If you agree with this, then you must agree that sampling variability is not the real interest.

To emphasize: the models are created to say something about those citizens not yet seen. There is information in the parameters of the model about those citizens, but it is only indirect and vague. There can be information in the internal metrics like p-values, Bayes factors, or other model fit appraisals, but these are either useless for our stated purpose or they overstate, sometimes quite wildly, the uncertainty we have in the measurement for unseen citizens.

That means we don’t really care about the parameters, or the uncertainty we have in them, not if our true interest are the remaining citizens. So why so much passionate focus on them, then? Because of the mistaken view that the measures (of the citizens) are “drawn” from a probability distribution. It is these “draws” that produce, it is said, the sampling variability.

The classical (frequentist, Bayesian) idea is that the measures are “drawn” from a probability distribution—the same one used in the model—that that measures are “distributed” according to the probability distribution, that they “follow” this distribution, that they are therefore caused, somehow, by this distribution. This distribution is what creates the sampling variability (in the parameters and other metrics) on repeated measures (should there be any).

And now we recall de Finetti’s important words:


If this is so, and it is, how can something which does not exist cause anything? Answer: it cannot.

The reality is that some thing or things, we know not what, caused each of the citizens’ measures to take the values they do. This cannot be a probability. Probability is a measure of uncertainty, the measure between sets of propositions, and is not physical. Probability is not causality. If we knew what the causes were we would not need a probability model, we would simply state what the measurements would be because of this and that cause.

Since we don’t know the causes completely, what should happen is that whatever evidence we have about the measurements lead us to adopt or deduce a probability model which says, “Given this evidence, the possible values of the measure have these probabilities.” This model is updated (not necessarily in the sense of using Bayes’s theorem, but not excluding it either) to include the set of fifty measures, and then the model can and should be used to say something about the citizen’s not yet measured.

Since I know some cannot think about these things sans notation, I mean the following. We start with this:

     [A] Pr( Measure take these values in the 300+ million citizens | Probative Evidence),

where the “probative evidence” is what leads us to the probability model; i.e. [A] is the model which tells us what probabilities the measures might take given whatever probative evidence we assume. After observations we want this:

     [B] Pr( Measure take these values in remaining citizens | Observations & Probative Evidence).

This gives the complete picture of our uncertainty given all the evidence we decided to accept. Everybody accepts observations, unless doubt can be cast upon them, but the “Probative evidence” is subject to more argument. Why? Usually the model is decided by tradition or some other non-rigorous manner; but whatever method of deciding the initial premises is used, it produces the “Probative evidence.”

There is thus no reason to ever speak of “sampling variability.” If we do happen upon another set of measurements—not matter the size: only theory insists on equal “n”s each time—then we move to this:

     [C] Pr( Measure take these values in remaining citizens | All Observations thus far & Probative Evidence).

Once we measure all citizens, this probability “collapses” to probability 1 for each of the measures: e.g., “Given we measured all citizens, there is a 100% chance exactly 342 of them have the value 14.3,” etc.

Sampling variability never enters into the discussion because we always make full use of the evidence we have to say something about the propositions of interest (here, the measurement on all citizens). We don’t care about the interior of the models per se, the parameters, because they don’t exist in [C] (either they never exist, which is ideal, or they do as an approximation and they are “integrated out“). Neither does [C] say what caused the measures; it only mentions our uncertainty in unseen citizens.

The measure is not “distributed” by or as our model; instead, our model quantifies the uncertainty we have in the measure (given our probative premises and observations).


The incorrect idea of “drawing from” probability distributions began with “urn” models, an example of which is this. Our evidence is that we have an urn from which balls are to be drawn. Inside the urn are 10 black and 15 white marbles. Given this evidence, the probability of drawing a white marble is 15/25.

Suppose we drew out a black; the 10/25 probability did not cause us to select the black. The physical forces causing the balls to mix from the initial condition of however they were put there and considering the constituents of the marbles themselves and the manner of our drawing caused the draw. This is why we do not need superfluous and unduly mystical words about “randomness.”

We don’t need sampling variability here either. If we draw more than one marble, we can deduce the exact probability of drawing so-many whites and so-many blacks, with or without considering we replace the marbles after each draw. This isn’t sampling variability, merely the observational probability [C]. And, of course, there are no parameters (and never were).

If you get stuck, as many do, thinking about “randomness” and causality, change the urn to interocitors which can only take two states, S1 or S2, with 10 possibilities for the fictional device to take S1 and 15 for S2. Probability still gives us the (same) answer because probability is the study of the relations between propositions, just like logic, even though interocitors don’t exist. Think of the syllogism: All interocitors are marvelous and This is an interocitor; therefore, This is marvelous. The conclusion given the premises is true, even though there are no such things as interocitors. See these posts for more.


  1. That should be “n’est-ce pas” and not “n’cest pas”

    Also, it must be that I was taught a totally different kind of statistics. The definition of “sample variability” I had to retain was simply “a description of how large the sampling error would be over hypothetical repeated samples from the same population”, with “sampling error” being the difference between the value of the quantity of interest in the population and the estimate we get from a sample.

  2. Matt,

    You may have a small text error in your penultimate paragraph.

    The third sentence ends “with our without considering we replace the marbles after each draw.”

    Should this perhaps be “with or without considering whether we replace the marbles after each draw”?

    Me no know. Me but wonder.

    -=- Charlie

  3. caused each of the citizens’ measures to tale the values they do.
    Your enemies strike again.

  4. Well, I don’t care about your (putative) typos. You are continuing to sharpen your arguments. You are saying more, better, with fewer, more concise words.

    I never thought I’d see it, but you are becoming a word craftsman.

    In the kingdom of the blind, the one-eyed man is not king; more likely, his outraged fellows will banish him.

    But in the kingdom of the one-eyed, the two-eyed man may indeed be king. Seeing partially, some may indeed open both eyes.

    If there are enough to see, they will see what you see — If….

    A masterful treatment today. Let he who has eyes, let him read.

  5. All,

    My enemies have conspired with jet lag! Typos abound! Undoubtedly a government conspiracy.



  6. What does this say about CMIP? They’re not even “measuring” the same thing, are they? Yet, don’t they claim that this “averaging” of computer runs on climate models somehow increases confidence?

  7. Will & Phil, I believe that one such name applicable here is “mumpsimus,” a word with a very curious etymology.

Leave a Comment

Your email address will not be published. Required fields are marked *