# The strange insignificance of statistical significance

Who is more likely to support the death penalty: college undergraduates from a “nationally-ranked Midwestern university with an enrollment of slightly more than 20,000” majoring in social work, or those majoring in something else?

This question was asked by Sudershan Pasupuleti, Eric Lambert, and Terry Cluse-Tolar at the University of Toledo to 406 students, 234 of which were social work undergraduates. The answer was published in the the .Journal of Social Work Values and Ethics

“58% of the non-social work majors favored to some degree capital punishment” and only “36% of social work students” did. They report that these percentages (58% vs. 36%) represent a statistically “significant difference in death penalty support between social work and non-social work majors.” The p-value (see below) was 0.001.

What does statistically significant mean? Before I tell you, let me ask you a non-trick question. What is the probability that, for this study, a greater percentage of non-social work majors favored the death penalty? The probability is 1: it is certain that a greater percentage of non-social work majors favored the death penalty, because 58% is greater than 36%. The answer would be the same if the observed percentages were 37% and 36%, right? The size of the difference does not matter: different is different. Significance is not a measure of the size of the difference. Further, the data we observed tells us nothing directly about other groups of students (who were not polled and whose opinions remain unknown). Neither does significance say anything about new data: significance is not a prediction.

Since significance is not a direct statement about data we observed nor is it a statement about new data, it must measure something that cannot be observed about our current data. This occultism of significance begins with a mathematical model of the students’ opinions; a formalism that we say explains the opinions we observed: not how the students formed their opinions, only what they would be. Attached to the model are unobservable objects called parameters, and attached to them are notions of infinity which are so peculiar that we’ll ignore them (for now).

A thing that cannot be observed is metaphysical. Be careful! I use these words in their strict, logical sense. By saying that some thing “cannot be observed”, I mean just that. It is impossible—not just unlikely—to measure its value or verify its existence. We need, however, to specify values for the unobservable parameters or the models won’t work, but we can never justify the values we specify because the parameters cannot be seen. This predicament is sidestepped—not solved—by saying, in a sense, we don’t care what values the parameters take, but they are equal in all parts of our model. For this data, there are two parts: one for each of non-social majors and social majors.

Significance takes as its starting point the model and the claim of equality of its parameters. It then calculates a probability statement about data we could have seen but did not, assuming that the model is true and its parametric equalities are certain: this probability is the p-value (see above) which has to be less than 0.05 to be declared “significant.”

Remember! This probability says nothing directly about the actual, observed data (nor does it need to, because we have complete knowledge of that data), nor does it say anything about data we have not yet seen. It is a statement about and conditional on metaphysical assumptions—such that the model we picked is true and its parameters equal—assumptions which, because they are metaphysical, can never be checked.

Pasupuleti et al. intimated they expected their results, which implies they were thinking causally about their findings, about what causes a person to be for or against capital punishment. But significance cannot answer the causal question: is it because the student was a social work major that she is against capital punishment? Significance cannot say why there was a difference in the data, even though that is the central question, and is why the study was conducted. We do not need significance to say if there was a difference in the data, because one was directly observed. And significance cannot say if we’d see a difference (or calculate the probability of a difference) in a new group of students.

There was nothing special about Pasupuleti’s study (except that it was easy to understand): any research that invokes statistical significance suffers from the same limitations, the biggest of which is that significance does not cannot do what people want it to, which is to give assurance that the differences observed will persist when measured in new data.

Statistical significance, then, is insignificant, or insufficient, for use in answering any practical question or in making any real decision. Why, then, is significance used? What can be done instead? Stick around.

Update: Dan Hughes sends this story, a criticism of a study that purports the “statistical significance” of increased NIH funding and decreased death rates.

1. Doug M says:

I think I am missing the point.

Without actually reading the study, what does Pasupuleti et. al. say? It says to me:

We have a hunch that social work majors tend to be anti-death penalty. Perhaps anti-death penalty people are predisposed to chose social work, or perhaps the ciriculum pushes students into the anti-death penalty camp — at this point we don’t know. Before we investigate the causes or the implications, we conducted a survey to verify if our hunch is infact accurate. In our survey, non-social work majors were 50 percent more likely to say they were pro death penalty than social work majors. We think this is sufficently compelling to suggest we continue these investigations.

Is there anything wrong with that?

2. Briggs says:

Doug M,

The fault is mine. I’m trying to cram an incredibly difficult topic into 800 words (most people will not read beyond that; an empirical fact; also the length of newspaper columns).

There is nothing wrong with their study: it shows exactly what most of us would have expected. The fault lies in misunderstanding what “statistical significance” tells us. It does not tell us why the difference is there, nor if that difference will hold up for new data.

I’m writing a piece on models and theories will attempt to tie this post in with the smoothing posts.

3. DAV says:

Doesn’t the significance test in this particular case simply give weight to the idea that the results would still show non-social-in-favor greater than social-in-favor if the study were repeated? IOW: it answers the question about the reliability of the answer obtained through the survey. So I tend to disagree with the statement: “Neither does significance say anything about new data: significance is not a prediction.” It’s just that the prediction on new data is about repeatability of the A>B result. It still doesn’t actually PROVE that prediction but instead gives it more weight.

None of this, of course, says anything about the why of the result.

4. Briggs says:

No, DAV, it does not. Significance says nothing about new, observable data. It does make a claim about data different than ours in the following sense: if the researchers were to repeat their study an infinite number of times—each time being exactly the same except for “random” differences—then if the model is true and its parameters are equal then they would see a statistic as large or larger than the one they did see with a frequency equal to the p-value.

The statistic, incidentally, is not the difference in percentages (it is something called a chi-squared, in this case).

Significance only says that and nothing more. Of course, using our background knowledge etc. we can say more about this (or other) actual studies. But significance must remain silent on these points.

5. DAV says:

Briggs,

I should have read the whole thing. I obviously disagree with “the biggest of which is that significance does not cannot do what people want it to, which is to give assurance that the differences observed will persist when measured in new data. ”

Example, I conduct a study (of say 10 people) and ask if they prefer A over B. 6 say yes. So 60% say A>B because 6>4. Without running any numbers I suspect this result is far less reliable than if I had asked 100 persons and got the same result. I’m not exactly sure what you call “assurance” but I would have more faith in the 100-person survey than the 10-person one.

Are you really saying that is only illusion? How do you assess the reliability of survey results?

Maybe my problem is in that I don’t see chi-squared as “The statistic is not the difference in percentages (it is something called a chi-squared, in this case)” but instead have assumed the chi-square gives me an idea how closely my observed distribution compares to equally distributed, which effectively decides between A=B and A!=B. I supply the built in assumption that, if I’m far from A=B, the one that I observed (i.e., A>B) is the more likely result.

Am I all wet?

6. Briggs says:

Nope, DAV, you’re not. Your thinking about the real data and the actual situation is intuitive and inductive. I do not fault it; further, I agree with it.

Unfortunately, however, if you want to think like a statistician you have to give up on your intuition (with regard to significance). Significance really is exactly what I say it is: you must not use it to infer what future observable data will be. It really is a statement about statistics conditional on beliefs about metaphysical parameters.

The only point at which a classical statistician might disagree with me is that some of them (by no means all) would claim that the parameters are the same for future observable data as they are for the past, already observed data. But—and here’s where it gets tricky—it does not follow that the fractions of supporters in the two groups will be the same as the old in any new collection of data, nor does this belief about parameters allow us to even infer the probability that the actual fractions will take any particular value.

I’m hoping JH will ring on on this.

Incidentally, my fix for the situation is not to implement a Bayesian parametric scheme. Statistics can be done without parameters (right, now-deceased Mr Geisser?).

7. DAV says:

Briggs,

“Butâ€”and hereâ€™s where it gets trickyâ€”it does not follow that the fractions of supporters in the two groups will be the same as the old in any new collection of data, nor does this belief about parameters allow us to even infer the probability that the actual fractions will take any particular value.”

Well, yeah, but does anyone really care that 60% X favor A against the 40% Y who also do? Or is “More X favor A than those in group Y do” the real point?

Outside of that why again doesn’t a chi-squared p-value give a hint about result reliability?
If I’m asking “did I really see A!=B instead of A==B?” and get a p-value of 0.001, I see that as saying “WOW! If A=B is really true, that is a truly unexpected result! Put your bets on A!=B” I see the definition of p-value you provided as no more than a particularly obtuse way of saying that.

Your right. It doesn’t actually predict future results, per se. It is instead an indication that the observation really IS a difference instead of an apparent one. One definition of “reliable” is “true”. The chi-squared test answers that, doesn’t it? And if it IS reliable in the sense of “true” (and the situation is unlike hitting a target with darts) then it gives weight to the idea that it is a repeatable result.

Reliability IS a parameter even if not explicitly stated. What could be more metaphysical than the concept of reliability?

8. Briggs says:

Dav,

The logical status of the questions “Does 60% X favor A against the 40% Y?” and “Do more X favor A than those in group Y?” are equivalent. They are both questions about observables. Significance cannot answer either question.

The p-value does not, and cannot, and was not designed, and is epistemically unsuited, to give a hint about a result’s reliability (in the ordinary, English usage of that word). The reasons for this are many, historical, and fascinating (search for p-values on this site). It is difficult to keep this in mind when significance is in line with what you expected; it is usually interpreted, as you are interpreting it now, as evidence that significance is doing the right thing, or at least approximating it. But this is false—and is so by design (also look up Fisher and Popper to see why).

Significance is not “an indication that the observation really IS a difference instead of an apparent one.” It is a statement conditional on a model and beliefs about that model’s parameters.

What you’re trying to answer is a question like this, “What is the probability that group A supports X more than group B, given the evidence I have?” Further, you are obviously asking that question about people who were not already polled because we already know more of A did so. Your question only makes sense when asked about future observable people. Significance cannot answer this question, and is lousy even at hinting at its answer.

9. Ian McLeod says:

Briggs,

I always understood statistical significance as saying something about future expectations, but your explanation has convinced me of the impossibility of that premise.

As an example, let us say social conditions change where the researcher was unaware of these social changes but assumed that repeating the survey would generate a similar outcome. Assume for a minute that a string of horrible child killers was in the news for six months and the consequence of the mediaâ€™s ubiquitous coverage resulted in shifting the opinions of social workers on the question of capital punishment.

If the researcher then repeated her original survey, she would calculate a different p-value compared to her first analysis. Fast-forward two years, with no child killers in the news, a new sample produces a different p-value, everything else being equal. Thus, the significance of the first experiment cannot say anything about future expectations because the researcher simply does not have access to future events, which, as you say, is necessarily true for all aspects of statistical significance.

Take the above example and compare it to AGW. I think climate scientists have used statistical significance as their bully pulpit when convincing policy advisers at the IPCC (and other similar agencies), who in turn bully an unwitting public that climate models have access to future outcomes and must be believed for fear of being labelled a denialist.

Ian

10. OMS says:

I guess I am confused. The study is taking a sample of the population of social work students and presumably trying to estimate the mean value of the random variable for the entire population of social work students (and other). So, assuming the sample is truly random, you should be able to establish confidence intervals for the true mean around the sample mean.

Wouldn’t you then be able to make some claim about significance based on the confidence intervals you just found?

11. OMS– I’m not sure where Briggs is going with this. but, his comments contain this:

if the researchers were to repeat their study an infinite number of timesâ€”each time being exactly the same except for â€œrandomâ€ differencesâ€”then if the model is true and its parameters are equal then they would see a statistic as large or larger than the one they did see with a frequency equal to the p-value.

Focusing on that, it’s possible to note that
a) the significance value tells use nothing about whether or not the statistical model I assumed applied to the data is true. Maybe I assumed a time series is linear and the residuals are Gaussian white noise. What if the noise isn’t Gaussian? Maybe the data are autocorrelated? Maybe there is a huge cyclical component with a period exceeding 100 years? (I could try to check, but maybe I won’t have enough data to reduce the type II error of a test for normally distributed data to a sufficiently low level. I then keep my assumption and decree that, based on that assumption, I find some answer with p=splendidly low value. )

b) Now, let’s assume the statistical model was right.

It may not be possible to do any test that includes only “random” differences. So, for example, if test a climate model against observations examining a very specific metric “X” right now. Because of the exigencies of applying statistical tests, we must test a specific hypothesis, like: Is the observed trend consistent with the model trends during the period of observation?

We run a test and say “yes” or “no”. But suppose we get an answer of “no”, the models are off. Then what? Do we know why? No. Maybe it’s aerosols. Maybe it was the initial condition. Maybe some modeling group had a boo-boo in their forcing file. Maybe the models are biased generally.

Given all the possibilities, it becomes very difficult to say for certain why there is a mismatch in the current test. So, while we can state our findings to a certain p level, we don’t know whether the the level of aerosols, rate of boo-boos in forcing files etc. will persist. So, we can’t know what will happen if a fresh batch of modelers were to start from square run, rethink write a fresh batch of models and run those. We can’t even know for sure that the model trends will persist in being wrong in the future. (If it’s a boo-boo in the forcing file and only affects the current years, maybe the models will get back on track. Who knows?)

Either way, the ‘p’ value tells us nothing about the relative likelihood of any possible causes of the deviation. It only tells us if the specific result we obtained might have happened by random chance, and even this answer is contingent on the assumptions.

c) Finally, Briggs doesn’t mention it in the bit I quoted above. But there is the difficulty that no test perfectly answers what everyone who reads a paper or result really wants to know. Those are answers to questions like: “Are climate models trustworthy?” “Over all, can modelers working collectively, do a respectable job predicting the magnitude of warming during the next 30 years “. There is no statistical test that will give us the answer to those questions. This is because the questions are too vague (i.e. what’s ‘respectable’) and because they also conceal a number of hidden assumptions. (Example: does someone mean, “Are the models trustworthy when driven by known forcings?” or do they mean “Given that we also need to predict forcings, could even the most perfect AOGCM predict the future?)

So, this leads us to the question of why we even use significance levels. The reason is that, if we do think our assumptions about the statistical model are plausible, we wish to distinguish between different outcomes that could easily happen by random chance and those that would be very rare if our assumptions were true. If an outcome is sufficiently rare, we need to go back and consider whether we one of the assumptions is false. Otherwise, we don’t waste our time checking if the assumptions we previously believed were wrong. We continue to believe them.

12. john says:

“What youâ€™re trying to answer is a question like this, â€œWhat is the probability that group A supports X more than group B, given the evidence I have?â€ Further, you are obviously asking that question about people who were not already polled because we already know more of A did so. Your question only makes sense when asked about future observable people. Significance cannot answer this question, and is lousy even at hinting at its answer.”

More specifically I am asking the question, ‘If I expand groups A and B to include people not yet surveyed, meeting the same parameters defining them as group A or B, are my initial data strong (significant) enough to support the claim – The expected outcome of surveying more people of these 2 groups would yield similar results’.

Unless you are deliberately obtuse in defining groups A and B, there is a level of statistical significance at which point you can make this claim.

13. OMS says:

Lucia,
I appreciate your points about implicit assumptions in climate models and the difficulty of establishing any useful concept of “significance.” (90% likely to cause at least 51% of the observed warming due to multimodal averaging?)

In this case I was focusing on the example of the student survey. The way I interpreted the following quote (my link to the Journal isn’t working):
“…these percentages (58% vs. 36%) represent a statistically â€œsignificant difference in death penalty support between social work and non-social work majors…”
was that the authors used sample estimators for the true mean among social work (vs. other) students.

Allowing for the assumption that the students actually have opinions, and the opinion has two outcomes: 1 = yes, 0 = no to the death penalty; and call “no preference” our null hypothesis, then I would expect a random choice of student to be equivalent to a coin toss. After sampling some (but not all) students and finding that the observed mean is noticeably different from 0.50, there ought to be a point where we can reject the null hypothesis. You can extend this same line of reasoning to two groups which might or might not have the same true mean.

Now if the 234 social work undergraduates represent all social work majors, then we know one of the percentages for sure; but the 36% surely came from a subsample of the 20,000+ enrollees in the university and hence there might be a meaningful “significance” (or not).

14. OMS–
I do think we can compute significance levels for the case of the student surveys and they do mean something.

So, I’m waiting to read Brigg’s notions on why the significance level itself doesn’t permit us to predict future outcomes. As in a climate change example, I would think the reasons have to do with the assumptions. In this case, those are the assumptions related to:

1) Having obtained random sample of social science students and other students and obtaining answer in an unbiased way. (Maybe students from one particular faculty members social science class were oversampled, while non-social science students were found at the local shooting range, which was chock full of gun-totin’ conservative death penalty devotees. Were the social science students surveyed in a large public group? Etc.)

2) Demographic change that might occur between the time of a first survey and a later one. Maybe, over the next decade, there will be some big “breakthrough” in social sciences that suddenly makes all the those teaching social science to explain that putting people to death is kinder or better than letting them live. Who knows? (This goes to his point about predicting the outcome of some later survey. An honest to goodness later survey has to be done on some group of identifiable individuals. If you goal is to predict the outcome of that survey, it would be wise to know the outcome of the previous survey, but also to know how or why people change their views. The ‘p’ value tells us nothing about the second thing.)

It does seem to me that computation of the “p” value associated with the survey result cannot tell us anything about the likelihood for things like 1 to occur. You certainly need to look at something other than p to gauge the likelyhood of 2. I’d think you would need to understand why social studies students might hold different opinions than other people to even begin to guess the effect of demographic changes.

So, in all, the ‘p’ value is useful in so much as it at least tells us the results can be distinguished from other results we would have obtained had random factors affected the test. But the high ‘p’ value by itself doesn’t help us predict the outcome of a future survey. We need to scrutinize other aspects of the survey and also understand something about people.

15. Oh… another tweak of what he might mean occurred to me. I once had a friend whose understanding of probability tests was this: If an experimental result showed we could not reject the null hypothesis at some suitable significance level, that meant the null hypothesis was more likely true than false. So, for example, if we did a test and found an increase in a Nusselt number as a function of Reynolds number, but the effect was not statistically significant at p=0.05, she thought that meant the test showed that it was more likely that Nu was unaffected by Re than the converse. She then thought we might predict that we would get the same result in future tests. All this was based on her understanding of what statistical significance told us.

However, the reality is that the significance level by itself isn’t useful information. In this case, it might be we took too little data to achieve statistical significance. If you want to answer the question: Is it more likey Nusselt number increases with Reynolds number based on the data collected, you would have to go all Bayesian on the problem. I don’t think the significance level ends up being used in that.

16. Briggs says:

Lucia, OMS,

It might be best, when thinking about these things, to forget what you learned in classical statistics classes. Most introductory, and even many higher-level, textbooks do not get it right when explaining the philosophical context.

P-values mean what I said they did: they are statements about data we did not get, conditional on believing a model and the equality of its parameters. Strange, yes. That you might have found p-values useful in the past in certain problems does not change its interpretation. It still tells you nothing—not one thing; I have to insist on this—about data you have not yet observed.

Significance etc. tells you something about the parameters of mathematical model you posit to be true. It tells you nothing about any observable, real-life thing. This is just the way it is.

A great book on this is Howson and Urbach: Scientific Reasoning (found here). I urge you to read it if you want to learn more about these subjects. 800 words is not going to be enough, though I will be talking about this more in the future.

Oh, yes, re: “Random sample. Since random means unknown, the classical concept of an unknown sample must be talking about something other than what we mean in plain English. Classical statisticians only wanted “random” samples because they thought this, in a sense, blessed them with randomness—the randomness used in picking the samples was supposed to be carried over into the sample—and randomness is a necessity for their models to work. Without it, you cannot calculate one. Oh, sure, you could. But it wouldn’t have that randomness tang that was required.

Please stick around to read what I’ll say on models and truth.

17. Significance etc. tells you something about the parameters of mathematical model you posit to be true.

Ok.. but you are being mysterious.

Either the mathematical model is true… or it’s not.

Let’s say I post the physical model says F= k(ma)+b and suggest we can find “k” and “b” from a suitable experiment. I do the experiment and find the best estimate for k and b are k = 1.05 and b=-0.03 newtons. I can confidence intervals for the parameters “k” and “b” based on my statistical model for the measurement uncertainty in my experiment. I can say that if my physical model is current the “true” value of must fall in some range (based on the sample estimate of k, and it’s confidence intervals.) I also test for significance k and b for statistical significance. I could also test a number of other things. (Is there a quadratic term? Etc.)

So…I know that “p” only tells me something about the probability distribution of for the “true” values of parameters k and b contingent of my physcial model being true. I know that if I proposed a quadratic term, all “p” would tell me is something about that quadratic parameter. Etc.

I know that “p” can’t tell me anymore than the probability distribution on the parameters “k” and “b”, contingent on my physical model being true. But is this what you are getting at so mysteriously? Or something else?

I guess I just have to wait until you reveal what misconception you are trying to correct. (I have no doubt I have many misconceptions, including what “p” might mean.

18. “P-values mean what I said they did: they are statements about data we did not get, conditional on believing a model and the equality of its parameters. Strange, yes. That you might have found p-values useful in the past in certain problems does not change its interpretation. It still tells you nothingâ€”not one thing; I have to insist on thisâ€”about data you have not yet observed.

William, should this be read literally or have your jumbled you terminology. I refer to the two statements:

“they [p-values] are statements about data we did not get”

“It still tells you nothingâ€”not one thing; I have to insist on thisâ€”about data you have not yet observed”

Can P-values simultaneously be “stements about data we did not get”, conditional or othwerwise AND “tell you nothing about data you have not observed”?

Is there a technical distinction you are trying to draw between “not yet observed” and “did not get”?

19. Nick says:

Concerning the usefulness (or, better, nonusefulness ;-)) of p values, it has always been my understanding that the p value gives you a probability that is just unimportant for the question. It gives you the probability of obtaining data at least as extreme as the actually observed ones, assuming that there is no effect (or, generally speaking, that the null hypothesis is true). But this probability is certainly *not* what we are interested in; we are interested in the probability that there is an effect, provided the given data.

If IÂ´m not mistaken, in classical statistics the last statement is rather meaningless because either the effect is there or not, so the only probabilities that can possibly occur are 1 and 0. In Bayesian statistics, you would still have a problem calculating the probability youÂ´re looking for, because this probability depends on the “prevalence”. Just like in medical testing, where the probability of having contracted, say, multiple climate alarmism sickness, under the condition that your medical test is positive, depends on the number of people in the population suffering from that desease.

So, the p value is in fact a statement about data we did not get (all data that are at least as extreme as the observed ones), conditional on a model (some probability distribution describing the outcome of the experiment if there is no effect present), and it does not tell us anything.

Is my take on that matter correct?

20. Briggs says:

Nick,

Bravo! Everything you say is spot on.

But there are ways to do things without parameters. We can never do with models, however, in the sense that models, while sometimes being deducable (sp?), are still conditional on the premises that led to their deductions. If those premises are wrong, so are the models.

21. But this probability is certainly *not* what we are interested in; we are interested in the probability that there is an effect, provided the given data.

Actually, sometimes we are interested in the probability of extreme events even if there is no effect.

22. OMS says:

Hmm, Dr. Briggs has an interesting point.

Thinking through the coin toss analogy again, the usual approach would say, okay null hypothesis = fair coin, toss it 6 times. Now I found 5/6 heads. What’s the chance that a fair coin would give at least 5/6 or 6/6 heads -> get your p-value. Low p-value, hence reject the null hypothesis. However, the outcome was also quite “extreme” for a coin which was 51/49 (for example), so you really want some idea how likely this outcome is in general over all possible coins…

Will think through this some more… 🙂

23. George says:

Hmm, so significance values don’t say anything about the model the author wants to fit, because that model is not an input – they’re calculated purely from observed data and a null model which the author doesn’t want to fit.

In effect, the only thing p<0.05 states is that a given model (usually the null model) is far from competent at explaining the observed results. It can never say anything strongly in favour of any model, null or not.

24. Briggs says:

By George, I think you’ve got it (sorry).