What Are Empirical P-values?

This is part II of contributor William Raynor’s request at defining p-values without using the word probability. This assumes you read and assimilated Part I. First, a clarification from Ryanor:

A context if you want it: Product Development. I (We) have a product that works as is, but we’d like to improve it if possible (like Fisher at Rothamsted.) So the “null” is a real, working product, not some flight of academic fantasy. In my case, it was a real profitable working product that had been continuously optimized for decades. We do not want to mess it up. Product Developers are trying to find improvements, in an intensely repetitive cycle. (Tinker, Test, Repeat.) The test subjects are not “random” samples from anywhere, so the designs are usually blocked, balanced, and blinded before the test products go out the door.

This is a terrific example because the causes, most but not all of them, are known in a manufacturing process. The widget is made from certain materials, put together in known ways, packaged according to set rules, so that that main causes are not a mystery.

The small causes that are responsible for the small widget-to-widget variations are not as well known, or are unknown altogether. Perhaps the weather influences the assembly line in a more-or-less known way, but one that can’t be tracked perfectly.

Measures will be taken on the widget. For the purposes of example, suppose it’s weight. (It doesn’t matter what it is.) The known causes make the widget what it is, are responsible for its nature, its expected weight. The small untracked, or rather unmanipulatable, causes are responsible for the variations in weight. If it weren’t for these small causes, every widget would have identical weights, because of the known major causes.

(For the record, I don’t know what Raynor’s product is or what measure(s) he tracks.)

Anyway, there will be a characteristic weight due to the major causes and small departures due to the small causes. This characteristic weight is easily itself tracked or measured.

One fine day somebody says, “Why don’t we try X?” in an effort to improve the widget. Somebody in charge says okay. It costs something to do X, which may or may not bring out some benefit.

X introduces new causes: if it did not, it would be null. Some of the effects of X can be known in advance, deduced via external evidence. Suppose a new paint will be used, which has known properties. These will change the weight in mostly predictable ways. Still, surprises are possible. Or perhaps the new effects aren’t really well delineated, so experiments are performed. Weight is measured with and without X.

Has X caused changes in the characteristic weight?

This should be easy to tell. Check the characteristic weight before and after X. If these differ, X is responsible. Assuming no other causes intervened. This is not a shifty or weird assumption. You use it constantly in judging how the world works. It wasn’t gremlins that caused your car to start this time apart from all the other times, though it might have been, if you allow for the possibility. We don’t allow for that possibility most times, which is sane.

If the characteristic weight under X is the same as under no-X, then X is no better than an unmeasureable minor cause. If the characteristic weights are different, then X is a major cause. We deduce this on the assumption there were no other causes beside the known major and usual small causes, and X. Of course, there may be the possibility that the assumed mechanism of X is not right, and something else is causing the changes under the “X regime” which is not the assumed X but only something associated with it. That’s not important for us, because either the characteristic weight changed or it didn’t because of X or the X regime.

Now how much change is change enough? There is no answer to that, no general one. That depends on the cost and benefit of the weight changes, which are not statistical questions. The same is true for the measured changes caused by X (or the X regime). How much is a enough is not a question any statistical model can tell you. The answer is: it depends.

What about the empirical p-value? The reasoning is like that in the first part. The “null hypothesis” is that X is not a cause, big or small. If that’s so, then all the measures of the widget are due to the old known causes. So far, so good; no flaws in logic yet.

Second step in empirical values: some aspect of the characteristic weight will be signaled out, like the mean. We can take the mean of the widgets with known causes, and the mean of the widgets under X. There will be some difference in weight (which may even be 0). Memorize this difference.

Third step: many do something like this. They’ll lump all the widget measures together, non-X and X, and then grab out samples from this mixture of the size of non-X and X, compute the means of both of these, and the difference in that mean. This will be done repeatedly, the difference in these means being saved each time. The justification for this sampling is the idea that the “distribution” of actual means inside non-X and X are real things, and the picking mechanism is supposed to make this distribution come alive. Seriously. See the gremlins link above. This sampling make no reference whatsoever to causes per se.

After a while, the observed difference in actual means (which you memorized) is compared to the distribution of differences you got in the sampling. The fraction of differences greater than your own (in absolute value, usually) is the empirical p-value.

The idea is that if this is wee, then the “null” is false, and X has been proved to be a cause. If the p is not wee, then X has been proved to not be a cause.

Talk about doing it the hard way!

Of course, we don’t have to use the mean. We could have used the, say, interquartile range. This will give a different empirical p-value. We could have also used the standard deviation. A different empirical p-value. And so on. None of these are “the” correct measure, unless one of these is the main or sole measure that plays in the cost and benefit.

As in the first part, we use a part of the data that did not happen, i.e. the fraction of differences larger than we observed in that odd sample, to say something about the causes that were actually in play. This is bizarre.

We could have bypassed all of this by just comparing the characteristic weights: any change is assumed due to X (or X regime), a good assumption. The size of the change that’s important depends on the uncertainty we have in the characteristic weight, and in the measured difference between X and non-X. It also depends on what the weight means to the cost and benefit. The forbidden word helps us with the uncertainty. It does not help with the cost and benefit, which are unique to your situation.

To support this site and its wholly independent host using credit card or PayPal (in any amount) click here

25 Thoughts

  1. Dr. Briggs, I’m enjoying these posts on p values. There is an important context that is being left out and one that comes from metrology: The measurement system that is supplying all of the product improvement data in this, or any other example, has an uncertainty and a limit to its resolution. Ignoring this limit is just that: ignorance.

    There is an international, very extensive body of knowledge realized in the GUM (Guide to Estimating Measurement Uncertainty) which is significantly older and more solid than most attempts to slap p values onto a decision problem.

    The message that you consistently promote that I appreciate the most is that all decisions belong to humans regardless of the analytical tools that we use to support ourselves.

  2. Puryear,

    Exactly so!

    I go into this in Uncertainty in some detail. Also see this paper, and those linked within.

    Also see the Book & Class page (at the menu) for the first couple of lessons in predictive statistics which has an extended discussion of this.

  3. OK, here’s the take-home I get (no doubt I missed something(s)?)

    Null(T) = NOT X => No effect because differences in means between widgets (+X) and (-X) will be close to zero.
    …we find…
    differences are small
    …wee conclude…
    Null(F)

    huh?

    So, what did I miss/misread?

  4. The randomization distribution or a bootstrap can allow one to extrapolate from sample to population. If I want to know something about all companies/people in the US, I take a sample of them and use that data to provide evidence (again, not “prove”) about the population of all companies/people.

    Yes, one can measure things besides means, like IQR, standard deviation and get different p-values. I don’t see the point of your point. No one has claimed that p-values are unique no matter what you measure. If you want to test for something else you can pretty easily. There are tests that do location and spread at the same time. In fact, that is one of the strengths of the error statistics approach, that you can do things piecemeal to find a good path to proceed through solving problems.
    (interesting that you don’t like counterfactuals (“data you didn’t observe”), but focus here on tests you could do but didn’t?)

    “We could have bypassed all of this by just comparing the characteristic weights: any change is assumed due to X (or X regime), a good assumption. The size of the change that’s important depends on the uncertainty we have in the characteristic weight, and in the measured difference between X and non-X. It also depends on what the weight means to the cost and benefit.”

    This does not get you evidence about the population from which your sample was drawn. This also just looks at location and not spread.

    How I talk about p-values without mentioning probability, is: the p-value is just a rescaling of the distance an observed test statistic is from what one would expect under a model.

    Justin

  5. Smith’s comment is exactly of the sort I meant when I said that the sampling makes the statistics come alive. Randomization does nothing. This is proved in that gremlins link above.

  6. “Can you rephrase your question” – Briggs

    If the Null Hypothesis is that X has no effect on the measured parameter, then I expect to see only minor random differences between the measured parameters (the zero you told us to memorize?) if the effect is minimal to nonexistent. Why then, does what I expect to confirm the NH cause us instead to reject it, just because they are small? It appears to be a contradiction. What am I missing?

  7. Yoanason,

    Forgive me, but I’m still not following you. If there is no or only very small and non-actionable differences in the characteristic measurements between pre-X (a better way to put it) and X, then X is not useful or causative. If the differences are large, then X is useful and causative, given our premises.

    The p-value, empirical or modeled, says nothing about this, which is why it should not be used.

  8. “The p-value, empirical or modeled, says nothing about this, which is why it should not be used.” – Briggs

    OK. I was just observing that it did appear to say something, but it was exactly the wrong thing. Was that your point?

    I hope that’s clearer.

  9. “Smith’s comment is exactly of the sort I meant when I said that the sampling makes the statistics come alive. Randomization does nothing. This is proved in that gremlins link above.”

    Alive, magic, etc. are your words, but if randomization does nothing, why does using it work so well i.e. results from at least 90 years of statistics in survey sampling and quality control and experimental design. Why is your method (just calculate descriptive statistics of the sample?) not the gold standard?

    Justin

  10. For the record, I worked with field tests, so the lab results were not very predictive.

    1. Sampling isn’t necessary and doesn’t make any thing come alive. It’s a computational shortcut for the actual array of “characteristic” values when the control (or other shifted) situation holds. For small samples I can evaluate all the “characteristic values.” aka “exact distribution” or “sets of typical values.” Further I can apply designs (e.g. fractional factorials) to get the null distribution.

    2. In a field test, the values vary with the subjects and with the material lots used. There is no common distribution. There are specification limits of course, but that merely reflects what the engineers can control. Can’t do that with humans in the field. The test product changes from test to test, as the product developers fiddle with the design, based on the results of the previous tests and new materials.

    3. The p-value measures how unusual the results are given the other uncontrollable effects. That’s it. Says nothing about unique or root causes. In this context it only assumes exchangeability/permutatbility within a block. That’s it. As both @justin and I have previously mentioned, repeatable effects under similar and varying situations are key.

  11. Bill,

    Perhaps you could outline the simplest possible experiment you in mind. Then I’ll prove to you the ideas about fit it.

    Justin’s idea about randomness making things come alive is all too true. See the gremlins example.

  12. Justin,

    bootstrap can allow one to extrapolate from sample to population

    Not really. All it does is magnify your sample. Any peculiarities in your sample are magnified as well.

  13. Dav

    Re:


    Justin,

    [ I wrote] bootstrap can allow one to extrapolate from sample to population

    Not really. All it does is magnify your sample. Any peculiarities in your sample are magnified as well.

    That seems like it would be true, but it is unlikely in my experience. Though like any model, you also need some assumptions:

    1) sample:population :: bootstrap sample:sample
    2) the sampling is done well
    3) There could be peculiarities as you say, say you’ve already identified outliers, maybe downweighted them if needed, or whatever. In any case, you’d be taking nCr(2n-1,n) possible bootstrap samples, which is very large – the peculiarities would most likely be negligible in the distribution of all bootstrap samples. For very large n, you’d just take a sample of the nCr(2n-1,n) possible bootstrap samples, typically 2000 or 5000.
    4) need slightly more assumptions if you are doing a parametric bootstrap

    But you know, I’m still left wondering about Briggs’ and others “randomization doesn’t work” and “sampling isn’t necessary” in problems like say estimating things for populations. Are you going to take a literal census for every research question? See any problems with feasibility, costs, and nonsampling errors?

    In testing medicines, how are treatments assigned to units (people, animals)? If not randomly, I’m going to ask, how?

    In “The Algorithmic Foundations of Differential Privacy”, they write about their industry standard method for disclosure avoidance:

    “Randomization is essential; more precisely, any non-trivial privacy guarantee that holds regardless of all present or even future sources of auxiliary information, including other databases, studies, Web sites, on-line communities, gossip, newspapers, government statistics, and so on, requires randomization.”

    Justin

  14. Briggs

    “Justin’s idea about randomness making things come alive is all too true. See the gremlins example.”

    Again, “alive”, etc., are your words, not mine.

    From the Gremlins post:

    “Random draws imbued the structure of the MCMC “process” with a kind of mystical life. If the draws weren’t random—and never mind defining what random really means—the approximation would be off, somehow, like in a pagan ceremony where somebody forgot to light the black randomness candle.”

    Well yeah, the statistic could be off, that is one of the main points of sampling (besides matters like cost and nonsampling errors). You could be biasing (in a bad way) how you are selecting the sample, leading to inaccurate estimates of a population total.

    It is not magical or mystical to take measurements of peoples’ heights, weights, blood pressure, notice they make a rough bell curve, and fit it with a mathematical bell curve, and therefore use bell curves in simulations since we know from measurement and observation and theory this to be the case. For example.

    That PRNGs are not really random is known for generations. Doesn’t take away their usefulness. See my domain slash randomnumbers dot htm, for my equivalent of your Gremlins post.

    The larger philosophical discussion of is there real randomness anywhere in the universe, or the religious question of did your favorite god(s) cause things, is orthogonal to the obvious I hope practicality of PRNGs for solving problems.

    Justin

  15. the sampling is done well

    Bootstrapping magnifies your sample giving more precision to your statistics all of which are unobservable and effectively made-up numbers. Precision is not accuracy. Besides, does it really matter that the guess of the mean is 3.516 +/- 0.02 instead of merely 3.5 or that the slope of your regression line (something you impress upon the data) is 0.0771 +/- 0.0003 instead of 0.077? Why?

    If the sample was “done well”, then you shouldn’t need to magnify it. Also, when do you know your sampling was “done well”? The short answer is: hardly ever.

    When using it during the so-called Hypothesis Test one runs into the dirty little secret that larger samples tend toward lower p-values. Yet another reason to avoid them.

    There could be peculiarities as you say, say you’ve already identified outliers, maybe downweighted them if needed

    So the cure is to alter the sample before extrapolating to the general population? Why not just make it up? How would you know how to weight them without already knowing the distributions in the general population? If you already know it then why are you extrapolating? What about unknown outliers? As an extreme example, if you naively sample only the populations of East and West coast cities you likely would think the entire country is 99+% Democrat. Bootstrapping won’t fix that no matter how many resamples you try.

    I’m still left wondering about Briggs’[s] and others “randomization doesn’t work” and “sampling isn’t necessary” in problems like say estimating things for populations.

    Two different things, actually. Randomization is not sampling.

    When experimenting, you need to take deliberate actions. “Randomization” is supposed to remove bias arising from expectation of results by experiment conductors and subjects but taking random actions won’t help and will do little to remove the effects of other variables (like sex or age). If you don’t want expectations to creep into the results, just don’t tell anybody when the dose is a placebo.

  16. Matt,

    Here’s an example for you:

    Some context first: This occurs in an industrial product devlopment enviroment (>1000 scientists/engineers, 10 statisticians, >1000 studies/year ranging from small 10-15 trained tester “sensory” studies to several hundred person, multi-week studies.) There are multiple teams working, some in competition and some in succession, on each product form (all consumer use). Given the volume involved and time constraints, there is little time for artisinal craft analyses. There is a high incentive for the developers to get the results dictated by their objectives (e.g. $$ and promotions)

    The panelists used in a given field trial are single-study, untrained, paid volunteers who would normally be using the particular type of product. The locations vary, spanning the United States, Europe and Asia.

    For a particular field trial there might be 3 or more experimental codes, a control code (a production product) and, say, 100 untrained panelists. Each panelist will test multiple products within multiple codes in a balanced RCB or a BIBD. There can be multiple responses of interest, including binary responses, counts, ratings, and censored “time to event” responses. The products are commonly collected after use and subjected to further measurements.

    Each test typically has multiple objectives (improve Attributes A,B, and C and do not degrade D, E, and F). The results of the analyses (including p-values) are used to flag deficiencies and modify the next round of testing. Since the control product is the result of decades of optimization, there are lots of failures. If a test fails, the developers try something else. There is rarely any “hard” loss functions since the eventual production costs are unknown in the middle stages of development and when there is its closely guarded (Note the internal competition for research dollars) Improvements can be shelved at a late stage because of production/material costs.

    The timing needs to fit within a short time frame from test completion, so the next tests can be queued. (e.g several days)

  17. @DAV,
    The whole random sampling argument is a mismatch with the use of statistics for experimental decision making. Random (and other) sampling is useful when you are trying to go from a sample to some targeted population with a sampling frame and the rest of that hoo-hah.

    Bootstraps, jackknifes and other resampling/subsampling techniques give you a means to easily estimate the sampling distribution of some statistics from the sample distribution. Exact distributions, Tukey’s jackknife and Hartigan’s sets of typical values don’t require any sampling at all, given the data, which aren’t required to be a random sample of any empirical population (e.g the fields of Rothamsted). Conformal analyses go even further, dropping probability totally and regarding everything as a sequence.

    If you’ve even looked at the outcome of a randomization, you’ll see that they do, in fact, tend to balance out the possible effect modifiers (in comparison to other non-randomized approaches.) Stephen Senn has written eloquently on this, but a useful exercise (which many applied statisticians do in grad school) is to actually do it. There is also a little theorem about central limits….

    Matt does have a thing about randomization, but I’ve ascribed that to his Bayesian bias. That seems to be popular with academic types. They like to assume there is a known likelihood and prior for everything. It is very elegant in theory. Dealing with human data quickly disabuses you of that notion. Particularly when you watch the motivated contortions that people will use to force the data into their objectives.

  18. Bill_R,
    Matt does have a thing about randomization, but I’ve ascribed that to his Bayesian bias. That seems to be popular with academic types. They like to assume there is a known likelihood and prior for everything.

    Then you don’t understand what he means when he says P(X|E). It’s a level of certainty in an outcome and always is based on given knowledge. There is no such thing as an unconditional probability. In other words: NO P(X) sans evidence (E). (E) must always be present to have a probability even if it is only the number of possible outcomes. P(BLUE) is nonsensical because there is no basis for a level of certainty. However, P(BLUE | 3 possible outcomes: RED, BLUE, GREEN) is perforce 1/3. Anything else assumes facts not in evidence. Yes, there are those claiming to be Bayesians who don’t understand this themselves.

    That said, you will at times see P(X) instead of P(X|E) for legibility in the literature. The given part should always be understood to be present even when unstated.

    Clearly, you think probability is a ratio of counts. Your comment on CLT indicates this. Apples and oranges.

  19. Quick note; more next week.

    My brief against randomization has nothing to do with Bayes. I am not a “Bayesian” in the traditional sense. More of a logician.

    Once more I will beg everybody read the gremlins link. At least that.

  20. Matt,
    Your logician tendencies come through loud and clear. I expect my casuitrist/ultrafinitist tendancies are somewhat apparent, too. We agree that

    RANDOMNESS IS NOT NECESSARY, for the simple reason randomness is merely a state of knowledge.

    and that Probability does not exist. Randomization is and, as far as I know, always have has been a substitute used when knowledge does not yet exist, insurance if you will. In practice, do you shuffle and cut the deck when you play cards?

    @DAV, For empirical work with data, yes, I look at proportions. Probability is the future tense of proportion, and is useful when making predictions and gambles. Not so much for dealing with observed. P-values are for abduction/retroduction: Are the data surprising? Probabilities are for induction. The reference set is deduced.

  21. Hi Dav,

    You can get the last one in on this 🙂 I’m not much for a lot of back and forth, especially about noncontroversial (except here?) things like adjusting, imputing, and bootstrapping and sampling.

    “Bootstrapping magnifies your sample giving more precision to your statistics all of which are unobservable and effectively made-up numbers.”

    I call “BS” on the “effectively made-up”, however, as it is a charged phrase. It is just sampled or modeled, a very standard and accepted method used since about 1980 in many different sciences.

    “…or that the slope of your regression line (something you impress upon the data)…”

    Again, you say “impress on” I say “calculated from”. Are you really arguing against calculating slopes? *facepalm*

    “Also, when do you know your sampling was “done well”? The short answer is: hardly ever.”

    I would say almost always, actually. You can follow standard sampling methodology, sample with rates known or estimated to be present in the population, post stratification, low nonresponse, coefficient of variation targets, large sample/population ratio, minimize total survey error, design similar to previous successful studies, several samples and compare, and so on. Sampling theory has been around what, since late 1930s or 40s?

    “When using it during the so-called Hypothesis Test one runs into the dirty little secret that larger samples tend toward lower p-values. Yet another reason to avoid them.

    All other things constant though, yes, and not a secret but well known. Of course, you may also know that there are methods to adjust alpha for large sample sizes. You control alpha, you control n, you control the experiment or sample. Hypothesis tests are the inference method most used the world over.

    Some news, the 2019 Nobel Prize for Medicine was just announced. They use a lot of p-values in their papers. Despite the heavy anti-p-value schtick here, it seems like their use is a way to do good science.

    “So the cure is to alter the sample before extrapolating to the general population? Why not just make it up? How would you know how to weight them without already knowing the distributions in the general population?”

    Because obviously adjusting for outliers is different from the charged phrase “make it up”. You seem to have an issue with any adjustment of observed data. If city has say incomes in 40,000 to 100,000 range, but one of the people you sampled wrote and income of 100,000,000, perhaps, just maybe, there was a data entry error. You can compare internally as well as compare to outside datasets and knowledge. I’m just saying identify outliers and correct if needed. This can be by contacting the person/company or using a model to fill it in (impute). This is very standard with any data collection and analysis. Of course, can also use methods that are robust to outliers being present. The point was that your outliers do not really damage bootstrapping in practice because there are so many possible samples as I stated and experience.

    “As an extreme example, if you naively sample only the populations of East and West coast cities you likely would think the entire country is 99+% Democrat. Bootstrapping won’t fix that no matter how many resamples you try.”

    Well yes, bootstrapping isn’t a cure-all, never said it was. Of course, it has to be done intelligently like anything else.

    Justin

  22. @justin,
    Channeling my inner stats professor: bootstrapping, jackknifing, and the whole resampling/subsampling/exact suite of techniques do just the opposite of making things up. They focus directly on the observed quantities. Efron wrote a monograph on this in the ’80s, and a few books since then.

    Adding unobservable parametric distributions, like the gaussian and friends, is “making things up.” They allow really convenient approximations to observables, though.

    So I agree with your BS call. @DAV appears to be repeating things heard elsewhere. Bayesians tend to confuse empirical data with the really slick theoretical solutions. (e.g. if you want likelihood, google “Empirical likelihood”. if you want a “prior” google “data fusion” and “data augmentation,” etc. )

    “In theory, theory and practice are the same. In practice, they are not.” — attributed to lots of people.

  23. Briggs, just leaving this here since you do not like randomization. 🙂
    https://www.newyorker.com/magazine/2010/05/17/the-poverty-lab

    They (Duflo, Banerhee, and Kremer) also just won the 2019 Nobel in Economics for their work.

    “Within economics, Duflo and her colleagues are sometimes referred to as the randomistas. They have borrowed, from medicine, what Duflo calls a “very robust and very simple tool”: they subject social-policy ideas to randomized control trials, as one would use in testing a drug. This approach filters out statistical noise; it connects cause and effect. The policy question might be: Does microfinance work? Or: Can you incentivize teachers to turn up to class? Or: When trying to prevent very poor people from contracting malaria, is it more effective to give them protective bed nets, or to sell the nets at a low price, on the presumption that people are more likely to use something that they’ve paid for? (A colleague of Duflo’s did this study, in Kenya.) As in medicine, a J-PAL trial, at its simplest, will randomly divide a population into two groups, and administer a “treatment”—a textbook, access to a microfinance loan—to one group but not to the other. Because of the randomness, both groups, if large enough, will have the same complexion: the same mixture of old and young, happy and sad, and every other possible source of experimental confusion. If, at the end of the study, one group turns out to have changed—become wealthier, say—then you can be certain that the change is a result of the treatment. A researcher needs to ask the right question in the right way, and this is not easy, but then the trial takes over and a number drops into view. There are other statistical ways to connect cause and effect, but none so transparent, in Duflo’s view, or so adept at upsetting expectations. Randomization “takes the guesswork, the wizardry, the technical prowess, the intuition, out of finding out whether something makes a difference,” she told me. And so: in the Kenya trial, the best price for bed nets was free.”

    No magic involved, just logic and science. Cheers,

    Justin

Leave a Reply

Your email address will not be published. Required fields are marked *