*This article is nothing but an extended link to a must-read piece in* City Journal. *Internet still once daily. Thanks to reader I. for suggesting this topic.*

If you haven’t already, you must read Jim Manzi’s *City Journal* article, What Social Science Does—and Doesn’t—Know: Our scientific ignorance of the human condition remains profound.

This man is my brother. Except for the common mistake in describing the powers of “randomization”, Manzi expounds a view with which regular readers will be in agreement:

[I]t is certain that we do not have anything remotely approaching a scientific understanding of human society. And the methods of experimental social science are not close to providing one within the foreseeable future.

His article will—I hope—dampen the spirits of the most ardent sociologist, economist, or clinician. For example, I cannot think of a better way of describing our uncertainty in the outcome of any experiment on humans than this:

In medicine, for example, what we really know from a given clinical trial is that

thisparticular list of patients who receivedthisexact treatment delivered inthesespecific clinics onthesedates bythesedoctors hadtheseoutcomes, as compared with a specific control group. But when we want to use the trial’s results to guide future action, we must generalize them into a reliable predictive rule for as-yet-unseen situations. Even if the experiment was correctly executed, how do we know that our generalization is correct?

Amen and amen.

Manzi did a sort of meta analysis, in which he examined the outcomes of 122 sociologist-driven experiments. Twenty-percent of these had “statistically significant” outcomes; that is, had p-values that were publishable, meaning they were less than the magic 0.05 level.

Only four of the twenty percent were replicated and *none* provided joy. That is, there were no more magic p-values.

This is the problem with classical statistics: it lets in too much riff raff, wolves in statistically significant clothing. It is far too easy to claim “statistical significance.”

Classical statistics has the idea of “Type I” and “Type II” errors. These names were not chosen for their memorability. Anyway, they have something to do with the decisions you make about the p-values (which I’ll assume you know how to calculate).

Suppose you have a non-publishable p-value, i.e., one that is (dismally) above the acceptable level required by a journal editor. You would then, in the tangled syntax of frequentism, “fail to reject the null hypothesis.” (Never, thanks to Popper and Fisher, will you “accept” it!)

The “null hypothesis” is a statement which equates one or more of the parameters of the probability models of the observable responses in the different groups (For Manzi, an experimental and control group).

Now, you could “fail to reject” the hypothesis that they are equal when you should have rejected it; that is, when they truly are unequal. That’s the “Type II” error. Re-read this small section until that sinks in.

But you could also see a publishable p-value—joy!—“by chance”. That is, merely because of good luck (for your publishing prospects), a small p-value comes trippingly out of your statistical software. This is when you declare “statistical significance.”

However, just because you see a small p-value does not mean that null hypothesis is false. It could be true and yet you incorrectly reject it. When you do, this is a Type I error.

Theory says that these Type I errors should come at you at the rate at which you set the highest allowable p-value, which is everywhere 0.05. That is, on average, 1 in every 20 experiments will be declared a success falsely.

Manzi found that the 122 experiments represented about 40 “program concepts”, and of these, only 22 had more than one trial. And only *one* of these had repeated success: “nuisance abatement”, i.e. the “broken windows” theory. Which, it must be added, hardly needed experimental proof, its truth being known to anybody who is a parent.

The problem, as I have said, is that statistical “significance” is such a weak criterion of success that practically any experiment can claim it. Statistical software is now so easy to use that only a lazy person cannot find a small p-value somewhere in his data.

The solution is that there is no solution: there will *always* be uncertainty. But we can do a better job quantifying uncertainty by stating our models in terms of their *predictive* ability, and not their success in fitting data.

This is echoed in Manzi:

How do we know that our physical theories concerning the wing are true? In the end, not because of equations on blackboards or compelling speeches by famous physicists but because airplanes stay up.