In a recent study, a greater fraction of Whites than Blacks were found to have a trait thought desirable (or undesirable, or a trait thought worth tracking). Something caused this disparity to occur. It cannot be that nothing caused it to occur. “Chance” or “randomness” are not operative agents and thus cannot cause anything to occur. It might be that we cannot know what caused it to occur, or that we guess incorrectly about what caused it to occur. But, I repeat, *something caused this difference*.

If you like, substitute “Pill A” and “Pill B”, or “Study 1” and “Study 2”, etc. for White and Black.

I observed a greater fraction of Whites than Blacks possessing some trait. Given this observation, what is the probability that a greater fraction of Whites than Blacks in my study possessed this trait? It is 1, or 100%. If you do not believe this, you might be a frequentist.

What is the probability that the proportion of trait-possessing Whites is twice—or thrice, or *whatever*—as high as Blacks in my study? It is either 1 or 0, depending on whether the proportion of trait-possessing Whites is twice (or whatever) as high as Blacks. *All I have to do is look.* No models are needed, no bizarre concepts of “statistical significance.” All we need do is count. We are done: any empirical question we have about the difference (or similarities) of Whites and Blacks in our study has probability 1 or 0. It is as simple as that.

Now suppose that we will see a certain number of Whites we have not seen before; likewise Blacks (they could even be the same Whites and Blacks if we believed the thing or things that caused the trait was non-constant). We have not yet measured this new group of Whites and Blacks so that we do not know whether a greater proportion of Whites than Blacks will be found to possess the trait. Intuition suggests that since we have already observed a group in which a greater proportion of Whites than Blacks possessed the trait, the new group will display the same disparity.

We can quantify this intuition with a model. There are many—*many*—to choose from. The choice of which one to use is ours. All the results derived from it assume that the model we have chosen is true.

One model simply says, “In any group of Whites and Blacks, a greater proportion of Whites than Blacks will be found to possess the trait.” Conditional on this model—that is, assuming this model is true—the probability there will be a greater proportion of trait-possessing Whites than Blacks in our new group is 1, or 100%. This simple model only makes a statement about Whites possessing the trait in higher frequency than Blacks. Thus, we cannot say what is the probability the proportion of trait-possessing Whites is twice (or whatever) as high as Blacks in my study.

Some models do not let you answer all possible questions.

We could create a model which dictates the probability that we find each multiple (from some set) of fractions of Whites than Blacks (e.g. twice, thrice, 1/2, 1/3, etc.), and then use this model to make probability statements about our new group. Since that would be difficult (and somewhat capricious), we could instead *parameterize* the differences in proportion.

We could use this model to answer the question, “Given this model is true, and given the observations we have made thus far, what is the probability that the parameters take a certain value?” This question is not terribly interesting and it does not answer what we really want to know, which is about the differences between Whites and Blacks in our new group. Why ask about some unobservable parameter? (The right answer is not, “Because everybody else does.”)

But given a *fixed value* of the parameters, we could answer the question, “Given this parameterized model is true, and given a fixed value of its parameters, and given the observations we have made thus far, What is the probability a greater fraction of Whites than Blacks will posses the trait?” This is almost what we want to know, but not quite, because it fixes the values of the unobservable parameters.

Simple mathematics allows us to answer this question for each possible value of the parameters, and then weighting the answers by the probability that the parameters take those values (this is from the parameter posterior distribution, which is conditional on the model being true and on the observations we have made thus far). The final number is the probability that the fraction of Whites is larger than Blacks in our new group. Which is what we wanted to know. (This is called the predictive posterior distribution.)

“Statistical significance” never once enters into this or any real decision. When you hear this term, it is always a dodge. It is an answer to a question nobody asks and nobody wants to know. It always assumes, as we do, on the truth of a model (though it remains silent about this, hoping by this silence to convince that no other models are possible). It tells us the probabilities of events that *did not happen*, and asks us to make decisions based on probabilities of these never-happened events. If you want to be mischievous, ask a frequentist why this makes sense. Homework: Locate Jeffreys’s relevant quote.

See the first in this series to discover what to do if we suspect our model is not true.

Categories: Philosophy, Statistics

Why not demonstrating the differences in the frequentist and Bayesian approaches using real data (e.g., GAT)?

The issue you have with statistical significance – is it more an issue with the bastardization that it has undergone through modern science/research/statistics, or is it more of a philosophical/technical issue?

Let’s say you found twice as many whites had the trait than blacks. Like you say, all you have to do is look – there is no modeling necessary. But isn’t there a huge difference between a sample size of 4 and a sample size of 4 million? Finding 2 whites with the trait and 1 black, versus 2 million whites with the trait and 1 million blacks. Is there any point in drawing conclusions from 2 whites with the trait, 1 black without and 1 black with? Statistical significance tells us no.

Basically…what’s so bad about this statement: it is highly unlikely that the observed effects can be due to sample error.

Sometimes a researcher might not have the funds, ability, etc. to find new observations, so can it not be a consolation effort to at least see how unlikely the variation between the two groups was assuming a null hypothesis? I know stat. significance only increases with more observations, but shouldn’t that be expected, even with very, very small effects/correlations?

I’m not a disagreeable frequentist by any means…I’m merely posing this question because I don’t know the answer and I’m assuming you do 🙂

Tom M,

Great question. Your intuition is right, but your conclusion wrong.

Suppose we observed 4 Whites and 4 Blacks and found that more Whites than Blacks had the trait. If we want to say whether new Whites and Blacks will share this disparity, we must first posit a model of the disparity (parameterized, as explained above). We would then still produce the posterior predictive distribution, from which we can ask, e.g., What is the probability that more (new) Whites will have the trait than (new) Blacks? This probability will be conditional on the truth of the model and on the prior observations.

Because we only saw 8 people, the probability of this particular disparity will be very, very close to 0.5, or 50%. And that is just what we would guess knowing nothing except that we have two groups of people. In other words, the Bayesian approach gives the right answer in terms of actual observables.

Second: increase the sample to a millions (as above), then the probability that more new Whites than new Blacks have the trait will be very, very high, close to 1 (but never reaching it).

Now, “statistical significance” does not answer any of these questions. It is always the probability of seeing more extreme data then we actually did see. In other words, it is the probability of data that did not occur. It is the chance of something that did not happen. It tells us the probability that an even greater fraction of Whites than Blacks than what we observed.

Actually, it’s worse than that. “Statistical significance” (via p-values) actually gives us the probability of seeing a statistic (some function of the data) larger than that one we actually did see, assuming both the model is true and something called a “null hypothesis” is true (the “null” is also conditional on the truth of the model).

Please let me know if this makes sense, or if I should say more.

You’re not making much sense.

I’ve got two toolboxes. One contains the evil out-dated Neyman/Pearson frequentist methods. The second contains the modern, sensible, correct Bayesian methods.

In the hands of the same statistician,

1) What can you do with the Bayesian toolbox that you can’t with the frequentist? What can you do with the frequentist box that you can’t do with the Bayesian box?

2) What are the things in each box that just simply get things wrong, even when used correctly?

And please, can we drop the pretense that classical/frequentist methods are completely and totally characterized by uses (and misuses) of hypothesis testing?

Mike B,

Alluses of hypothesis testing are wrong.Unlessthe question you must have answered (using our example) is this and this only: “Given my observations and the truth of my model, and assuming it is true that a certain statement about the parameter or parameters of that model taking a fixed value are true, what is the probability of seeing a test statistic larger than the one I actually got?”I know of nobody who wants to know that, except statisticians (whose minds are turned toward theory).

What people want to know, and what is of real interest, are questions like this:

1) Given the truth of my model and the past observations, what are the chances I’ll see new data that look like such and such?

2) Given the past observations, and given evidence that says that one of these set of models are true, What is the probability that my model (or hypothesis) is true?

Hypothesis testing cannot answer either question. You can, of course, use it—draw it from your toolbox, if you will—but is it a blunt instrument, a hammer, and all problems will look like a nail.

I will convince you yet, just wait and see.

I seem to be missing the point.

Suppose I believe that I will find Population A will have a higher incedence of X than Population B. To “prove” my hypotesis, I round up as many A’s and B’s as I can find and test for X.

I sample 1000 individuals, 500 from each popluation. 6% of A have X and 4% of B have X.

While this evidence would seem to suggest my proposition is true, I would also say that the result is “not statisticaly significant.” The null hypothosis — That the incidence X is the same in both populations — is still quite plausible. When I write up my results for publication, I would have to conclude that my hypotheis is still uncertain and more reseach isn’t required.

This is the “frequentist” approach that I learned in school. I am not sure what you think is inadaquate with this, or how you would interpret the results.

Sorry, to conclue my penultimate paragraph — more research IS required….

Corner your favorite PhD Statistician and ask him or her if 0.06 is significantly different from 0.00. I’d bet dollars to donuts he will answer “yes, by convention.” Of course, with today’s prices I’d be better of losing that bet.

Doug M,

Speaking with strict accuracy, your description is inadequate. Read my comments to Mike B and Tom M. And then ask yourself what

exactlyyou mean by your “hypothesis” and “null hypothesis.” Put your answer in quantitative terms.Thanks for the quick response Briggs.

I second Chris D’s suggestion to demonstrate the differences using real data. I think it would really help to be able to delineate the differences in the two approaches by seeing them step-by-step with some actual numbers.

When the differences are explained via text, it all seems a bit too abstract for me!

Thanks for the response, Matt. I’m not defending hypothesis testing (especially in the way it is often used), but rather in your continued straw-man characterization of frequentist methods as nothing more than a bunch of dopes blindly doing hypothesis tests.

Many of the questions you’ve asked CAN be answered using frequentist methods, and have been for nearly a century.

And even in the narrow case of hypothesis testing, I fail to see how Bayesian methods will in any meaningful sense improve on the answer to the famous problem of The Lady Tasting Tea.

http://en.wikipedia.org/wiki/Lady_tasting_tea

Mike B,

Nobody said “dopes.” I do not say it. Almost everybody, with rare exception, is taught to use it. Most practitioners recognize its limitations almost immediately, and invent rule-of-thumb workarounds to circumvent these limitations. How much better, then, to have just started with the right thing? Frequentist statistics should no longer be taught to any but PhD students. Bayesian statistics should replace all classical instruction through the Masters level.

Hypothesis testing, as I explained in my first answer to you, is never what anybody wants to know. You CAN answer a question, like the lady tasting tea, using hypothesis testing, but you will not be answering (a) correctly, or (b) the question we want to know. I will take up the Lady Tasting Tea as a direct example in a separate post.

Hypothesis testing is, I do say, largely responsible for the epidemic of over-certainty we see in the sciences (those that rely primarily on statistics, I mean).

From my earlier example —

Suppose that A represents exposure to some contaminant, B is non-exposure, and X is some illness.

We want to know how much greater P(X | A) is than P(X | B).

But first we must reject the null hypothesis:

P(X | A) = P(X | B)

We cannot definatively prove that the null hypothesis is false, we can show that the probability that it is true is “highly unlikely.”

Consider Doug M’s example above. I sample 1000 individuals, 500 from each popluation. 6% of A have X and 4% of B have X, so 5% of all people sampled.

I calculate (.06^30*.94^470)*(.04^20*.96^480)/(.05^50*.95^950)=2.8847, and so the probability of the data given the two separate rates is 2.8847 times the probability given a single rate. Unless I have some other a priori reason, I would not reject the single rate hypothesis, 2.8847 isn’t that much I think.

With 10000 individuals, and the same rates, I get (.06^300*.94^4700)*(.04^200*.96^4800) / (.05^500*.95^9500)=39898.03, and now I would reject the single rate in favor of the split rate.

I don’t see why I need a prior probability on *all* possible parameters to do this sort of analysis. I only care about the two sets of parameters I am considering. So I don’t buy into that sort of Bayesian processing. Nor do I need Fisher-style silliness about “more extreme” things that didn’t occur. In practice though, I’m not sure that what I would do is functionally much different from standard hypothesis testing. Eventually you get some statistic and base your decision on that, according to some arbitrary cutoff (e.g. .05) My cutoff above might be 1/.05=20. Is there really any difference? Both thresholds are made up.

So 1) What would be the parameterized model?

I don’t have an answer. To form an appropriate one, I would need to know the data structure to make reasonable theoretical assumptions. Identification of a model with the data is the starting point for both classical and Bayesian methods, and is one of the most interesting and essential components in the science of statistics. And I donâ€™t think that

â€œthere are manyâ€”manyâ€”to choose fromhere.I’ll ask one question at a time.

All,

The Lady Tasting Tea, with practicum, will appear beginning Sunday, 13 March.

SteveBrooklineMA,

“I donâ€™t see why I need a prior probability on *all* possible parameters to do this sort of analysis.” If you mean a continuum of parameters, then neither do I (but we’re in the minority). If you mean a prior on all possible outcomes, then I disagree. See this paper for an explanation.

I agree that if I have a population of N, then the only possible fractions p of people having a particular trait are p=0, 1/N, 2/N, …, N/N. If I’m only concerned about P(p=M/N|data) vs P(p=1/2|data) for some particular M, then it seems to me I only need a prior for P(p=1/2) and P(p=M/N). I agree that if I want to do more, say look into P(p>1/2|data) then I would need more.