Skip to content

Category: Statistics

The general theory, methods, and philosophy of the Science of Guessing What Is.

January 22, 2018 | 9 Comments

Another Proof of the Uselessness of P-values

Susan Holmes has done us a service by writing clearly the philosophy of the p-value in her new paper “Statistical Proof? The Problem of Irreproducibility” in the Bulletin of the American Mathematical Society (Volume 55, Number 1, January 2018, Pages 31–55).

The thesis of her paper, about which I am in the fullest possible support, is this: “Data currently generated in the fields of ecology, medicine, climatology, and neuroscience often contain tens of thousands of measured variables. If special care is not taken, the complexity associated with statistical analysis of such data can lead to publication of results that prove to be irreproducible.”

About how to fix the problem we disagree. I say it won’t be any kind of p-value, or p-value-like creation.

Here from the opening are the clear words:

Statisticians are willing to pay “some chance of error to extract knowledge” (J. W. Tukey [87]) using induction as follows.

If, given (A => B), then the existence of a small ε such that P(B) < ε tells us that A is probably not true.

This translates into an inference which suggests that if we observe data X, which is very unlikely if A is true (written P(X|A) < ε), then A is not plausible. [A footnote to this sentence is pasted next.]

We do not say here that the probability of A is low; as we will see in a standard frequentist setting, either A is true or not and fixed events do not have probabilities. In the Bayesian setting we would be able to state a probability for A.

I agree with her definition of the p-value. In notation, the words (of the third paragraph) translate to this:

    (1) Pr(A|X & Pr(X|A) = small) = small.

The argument behind this equation is fallacious. To see why, first convince yourself the notation is correct.

I also agree—with a loud yes!—that under the theory of frequentism “fixed events do not have probabilities.”

But in reality, of course they do. Every frequentist acts as if they do when they say things like “A is not plausible”. Not plausible is a synonym for not likely, which is a synonym for of low probability. In other words, every time a frequentist uses a p-value, he makes a probability judgement, which is forbidden by the theory he claims to hold.

Limiting relative frequency, as we have discussed many times, and often in Uncertainty, is an incorrect theory of probability. But let that pass. Believe it if you like; say that singular events like A cannot have probabilities (which does follow from the theory), and then give A a (non-quantified) probability after all. Let’s pretend we do not see the glaring, throbbing inconsistency.

Let’s instead examine (1). It helps to have an example. Let A be the theory “There is a six-sided object that when activated must show one of the six sides, just one of which is labeled 6.” And, for fun, let X = “6 6s in a row.” Then Pr(X|A) = small, where “small” is much weer than the magic number (about 2×10^-5). So we want to calculate

    (1) Pr(A|6 6s on six-sided device & Pr(6 6s|A) = 2×10^-5) = ?

Well, it should be obvious there is no (direct) answer to (1). Unless we magnify some implicit premises, or add new ones entirely.

The right-hand-side (the givens) tell us that if accept A as true, then 6 6s are a possibility; and so when we see 6 6s, if anything, it is evidence in favor of A’s truth. After all, something A said could happen did happen!

Another implicit premise might be that in noticing we just rolled 6 6s in a row, there were other possibilities. We also notice we can’t identify the precise causes of the 6s showing, but understand the causes are related to standard physics. These implicit premises can be used to infer A.

We now come to the classic objection, which is that no alternative to A is given. A is the only thing going. Unless we add new implicit premises that give us a hint about something beside A. Whatever this premise is, it cannot be “Either A is true or something else is”, because that is a tautology, and in logic adding a tautology to the premises is like multiplying an equation by 1. It changes nothing.

Not only that, if you told a frequentist that you were rejecting A because you just saw 6 6s in the row, and that therefore “another number is due”, he’d probably accuse you of falling prey to the gambler’s fallacy. Again, we cannot expect consistency in any limiting relative frequency argument.

But what’s this about the gambler’s fallacy? That can only be judged were we to add more information to the right hand side of (1). This is the key. Everything we are using as evidence for or against A goes on the right hand side of (1). Even if it is not written, it is there. This is often forgotten in the rush to make everything mathematical.

In our case, to have any evidence of the gambler’s fallacy would entail adding evidence to the RHS of (1) that is similar to, “We’re in a casino, where I’m sure they’re real careful about the dice, replacing worn and even ‘lucky’ ones, and they way they make you throw the dice make it next to impossible to control the outcome”. That’s only a small summary of a large thought. All evidence that points to A.

But what if we’re over on 34th street at Tannen’s Magic Store and we’ve just seen the 6 6s, or even 20 6s, or however many you like? The RHS of (1), for you in that situation, changes dramatically, adding possibilities other than A.

In short, it is not the observations alone in (1) that get you anywhere. It is the extra information you add that works the magic, as it were. And whatever you add to (1), (1) is no longer (1), but something else. If you understand that, you understand all. P-values are a dead end.

Bonus argument This similar argument I wrote appears in many places, including in a new paper about which more another day:

Fisher said: “Belief in null hypothesis as an accurate representation of the population sampled is confronted by a logical disjunction: Either the null is false, or the p-value has attained by chance an exceptionally low value.” Something like this is repeated in every elementary textbook.

Yet Fisher’s “logical disjunction” is evidently not one, since his either-or describes different propositions, i.e. the null and p-values. A real disjunction can however be found. Re-writing Fisher gives: Either the null is false and we see a small p-value, or the null is true and we see a small p-value. Or just: Either the null is true or it is false and we see a small p-value. Since “Either the null is true or it is false” is a tautology, and is therefore necessarily true no matter what, and because prefixing any argument with a tautology does not change that argument’s logical status, we are left with, “We see a small p-value.” The p-value thus casts no light on the truth or falsity of the null. Everybody knows this, but this is the formal proof of it.

Frequentist theory claims, assuming the truth of the null, we can equally likely see any p-value whatsoever, i.e. the p-value under the null is uniformly distributed. To emphasize: assuming the truth of the null, we deduce we can see any p-value between 0 and 1. And since we always do see any value, all p-values are logically evidence for the null and not against it. Yet practice insists small p-value are evidence the null is (likely) false. That is because people argue: For most small p-values I have seen in the past, I believe the null has been false; I now see a new small p-value, therefore the null hypothesis in this new problem is likely false. That argument works, but it has no place in frequentist theory (which anyway has innumerable other difficulties). It is the Bayesian-like interpretation.

The decisions made using p-values are thus an “act of will”, as Neyman criticized, not realizing his own method of not-rejecting and rejecting nulls had the same flaw.

What to use instead? Pure probability, baby. See our class for examples. Or read all about it in Uncertainty.

January 18, 2018 | 8 Comments

Pay More Taxes And Live Longer. Wee P-value Alert!

It’s bottom-of-the-barrel trolling to cite anything from Daily Kos, but this one is particularly asinine and worth showing to demonstrate the futility of p-values. Thanks to Al Perrella for the discovery.

The article is “Pay more taxes and live longer, pay less taxes and die sooner, the choice is yours“. A snapshot of the “finding” is above, but go to their site for the full gory.

One of the things that our taxes should provide is a healthy populace, which should be reflected in the life expectancy of its citizens. And fortunately, mortality is something that is easy to measure. We have been running a set of economic experiments in our states for decades, and the outcomes should be discernible. If low taxes are better than high taxes, then that should be clearly articulated in the life expectancy data.

In fact, the opposite is the case, and the trend is unmistakable. The chart below plots life expectancy by state for the year 2013-14 against the total taxes (federal, state, and local) paid by its citizens. The tax year 2015 was chosen since those data were easily accessible. A trend line (linear regression) is also shown, which represents the best straight line that can be fit to these data. The raw data that I used are also shown in the table below. Federal, state, and local tax data for 2015 were obtained from sites here and here. Life expectancy data were obtained here.

The big “finding” is that “life expectancy is correlated with taxes”, and from correlation, which has no implication of cause, the author can’t help himself, as almost everybody can’t, and he jumps to cause. “Mortality statistics across the United States suggest that residents of blue states are healthier than those in red states.”

Yet one of the middling-lower taxed states has residents with the highest life expectancy. Grouping across states is also silly, as it’s the bigger cities that differ from rural areas, contrasts where we also find the biggest political differences.

Alabama looks to have the lowest life expectancy averaged across its residents. Raise your hand if you think it’s taxes that makes the even any difference between it and, say, New Hampshire, which has one of the highest life expectancies — and low taxes.

Well you can do this sort of thing as well as I can, and probably have more patience. Here’s the writer’s conclusion:

Tax rate statistics suggest that higher taxes are favorably correlated with mortality. Call me crazy, but if I were a resident of a red state, I would be calling my local, state, and federal legislators, demanding that my taxes be raised, not lowered, and that the increased revenues be used to improve my health and the health of my fellow citizens. Starting now.

Okay, I’ll call him crazy. We went from correlation, and a weak one at that, and somewhat silly, to “Call your congressman now.”

People just can’t resist finding causes in wee p-values.

January 16, 2018 | 4 Comments

Free Data Science Class: Predictive Case Study 1, Part VIII


We’re continuing with the CGPA example. The data is on line, and of unknown origin, but good enough to use as an example.

We will build a correlational model, keeping ever in mind this model’s limitations. It can say nothing about cause, for instance.

As we discussed in earlier lessons, the model we will build is in reference to the decisions we will make. Our goal in this model is to make decisions regarding future students’ CGPAs given we have guesses or know their HGPA, SAT, and possibly hours spend studying. We judge at least the first two in the causal path of CGPA. Our initial decision cares about getting CGPA to the nearest point (if you can’t recall why this is most crucially important — review!).

It would be best if we extended our earlier measurement-deduced model, so that we have the predictive model from the get go (if you do not remember what this means — review!). But that’s hard, and we’re lazy. So we’ll do what everybody does and use an ad hoc parameterized model, recognizing that all parameterized models are always approximations to the measurement reality.

Because this is an ad hoc parameterized model, we have several choices. Every choice is in response to a premise we have formed. Given “I quite like multinomial logistic regression; and besides, I’ve seen it used before so I’m sure I can get it by an editor”, then the model is in our premises. All probability follows on our assumptions.

Now the multinomial logistic regression forms a parameter for every category—here we have 5, for CGPA = 0-5—and says those parameters are functions of parameterized measurements in a linear way. The math of all this is busy, but not too hard. Here is one source to examine the model in detail.

For instance, the parameter for CGPA = 0 is itself said to be a linear function of parameterized HGPA and SAT.

These parameters do not exist, give no causal information, and are of no practical interest (no matter how interesting they are mathematically). For instance, they do not appear in what we really want, which is this:

    (8) Pr(CGPA = i | guesses of new measures, grading rules, old obs, model), where i = 0,…,5.

We do not care about the parameters, which are only mathematical entities needed to get the model to work. But because we do not know the value of the parameters, the uncertainty in them, as it were, has to be specified. That is, a “prior” for them must be given. If we choose one prior, (8) will given one answer; if we choose a different prior, (8) will (likely) give a different answer. Same thing if we choose a different parameterized model: (8) will give different answers. This does not worry us because we remember all probability is conditional on the assumptions we make. CGPA does not “have” a probability! Indeed, the answers (8) gives using different models are usually much more varied than the answers given using the same model but different priors.

What prior should we use? Well, we’re lazy again. We’ll use whatever the software suggests, remembering other choices are possible.

Why not use the MNP R Package for “Fitting the Multinomial Probit Model”? But, wait. Probit is not the same as Logit. That’s true, so let’s update our ad hoc premise to say we really had in mind a multinomial probit model. If you do not have MNP installed, use this command, and follow the subsequent instructions about choosing a mirror.

install.packages('MNP', dependencies = TRUE)

There are other choices beside MNP, but unfortunately the software for multinomial regressions is not nearly as developed and as bullet proof as for ordinary regressions. MNP gives the predictive probabilities we want. But we’ll see that it can break. Beside that, our purpose is to understand the predictive philosophy and method, not to tout for a particular ad hoc model. What happens below goes for any model that can be put in the form of (8). This includes all machine learning, AI, etc.

The first thing is to ensure you have downloaded the data file cgpa.csv, and also the helper file briggs.class.R, which contains code we’ll use in this class. Warning: this file is updated frequently! For all the lawyers, I make no guarantee about this code. It might even destroy your computer, cause your wife to leave you, and encourage your children to become lawyers. Use at your own risk. Ensure Windows did not change name of cgpa.csv to cgpa.csv.txt.

Save the files in a directory you create for the class. We’ll store that directory in the variable path. Remember, # comments out the rest of what follows on a line.

path = 'C:/Users/yourname/yourplace/' # for Windows
#path = '/home/yourname/yourplace/' # for Apple, Linux
# find the path to your file by looking at its properties
# everything in this class is in the same directory

source(paste(path,'briggs.class.R',sep='')) # runs the class code
x = read.csv(paste(path,'cgpa.csv',sep='')) 
 x$cgpa.o = x$cgpa # keeps an original copy of CGPA
 x$cgpa = as.factor(roundTo(x$cgpa,1)) # rounds to nearest 1

You should see this:

>  summary(x)
 cgpa        hgpa            sat           recomm          cgpa.o     
 0: 4   Min.   :0.330   Min.   : 400   Min.   : 2.00   Min.   :0.050  
 1:17   1st Qu.:1.640   1st Qu.: 852   1st Qu.: 4.00   1st Qu.:1.562  
 2:59   Median :1.930   Median :1036   Median : 5.00   Median :1.985  
 3:16   Mean   :2.049   Mean   :1015   Mean   : 5.19   Mean   :1.980  
 4: 4   3rd Qu.:2.535   3rd Qu.:1168   3rd Qu.: 6.00   3rd Qu.:2.410  
        Max.   :4.250   Max.   :1500   Max.   :10.00   Max.   :4.010  
> table(x$cgpa)

 0  1  2  3  4 
 4 17 59 16  4 

The measurement recomm we’ll deal with later. Next, the model.

require(MNP) # loads the package

fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=2000)
#fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=10000)

The model call is obvious enough, even if burnin = 2000, n.draws=2000 is opaque.

Depending on your system, the model fit might break. You might get an odd error message ("TruncNorm: lower bound is greater than upper bound") about inverting a matrix which you can investigate if you are inclined (the problem is in a handful of values in sat, and how the model starts up). This algorithm uses MCMC methods, and therefore cycles through a loop of size n.draws. All we need to know about this (for now) is that because this is a numerical approximation, larger numbers give less sloppy answers. Try n.draws=10000, or even five times that, if your system allows you to get away with it. The more you put, the longer it takes.

We can look at the output of the model like this (this is only a partial output):

> summary(fit)

mnp(formula = cgpa ~ sat + hgpa, data = x, n.draws = 50000, burnin = 2000)

                    mean       2.5%  97.5%
(Intercept):1 -1.189e+00  2.143e+00 -7.918e+00  0.810
(Intercept):2 -1.003e+00  1.709e+00 -5.911e+00  0.664
(Intercept):3 -8.270e+00  3.903e+00 -1.630e+01 -1.038
(Intercept):4 -2.297e+00  3.369e+00 -1.203e+01 -0.003
sat:1          9.548e-04  1.597e-03 -3.958e-04  0.006
sat:2          1.065e-03  1.488e-03 -7.126e-06  0.005
sat:3          4.223e-03  2.655e-03  2.239e-05  0.010
sat:4          1.469e-03  2.202e-03  1.704e-06  0.008
hgpa:1         9.052e-02  3.722e-01 -5.079e-01  0.953
hgpa:2         1.768e-01  3.518e-01 -2.332e-01  1.188
hgpa:3         1.213e+00  6.610e-01  1.064e-01  2.609
hgpa:4         3.403e-01  5.242e-01 -7.266e-04  1.893

The Coefficients are the parameters spoken of above. The mean etc. are the estimates of these unobservable, not-very-interesting entities. Just keep in mind that because a coefficient is large, does not mean its effect on the probability of CGPA = i is itself large.

We do care about the predictions. We want (8), so let's get it. Stare at (8). On the right hand side we need to guess values of SAT and HGPA for a future student. Let's do that for two students, one with a low SAT and HGPA, and another with high values. You shouldn't have to specify values of CGPA, since these are what we are predicting, but that's a limitation of this software.

y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4))
a=predict(fit, newdata = y, type='prob')$p

The syntax is decided by the creators of the MNP package. Anyway, here's what I got. You will NOT see the exact same numbers, since the answers are helter-skelter numerical approximations, but you'll be close.

> a
            0          1         2      3            4
[1,] 0.519000 0.24008333 0.2286875 0.0115 0.0007291667
[2,] 0.000125 0.04489583 0.1222917 0.6900 0.1426875000

There are two students, so two rows of predictions for each of the five categories. This says, for student (sat=400, hgpa=1), he'll most like see a CGPA = 0. And for (sat=1500, hgpa=4), the most likely is a CGPA = 3. You can easily play with other scenarios. But, and this should be obvious, if (8) was our goal, we are done!

Next time we'll build on the scenarios, explore this model in more depth, and compare our model with classical ones.

Homework Play with other scenarios. Advanced students can track down the objectionable values of sat that cause grief in the model fit (I wrote a script to do this, and known which ones they are). Or they can change the premises, by changing the starting values of the parameters. We didn't do that above, because most users will never do so, relying on the software to work "automatically".

The biggest homework is to think about the coefficients with respect to the prediction probabilities. Answer below!

January 12, 2018 | 5 Comments

Similarities Between ESP, Cold Fusion & Global Warming

Stream: Similarities Between ESP, Cold Fusion & Global Warming

At the climate website No Tricks Zone, there is a picture of various estimates of CO2 climate sensitivity estimates. These are the guesses of how much the temperature would increase if atmospheric carbon dioxide would double from its pre-industrial levels.

This sensitivity is measured as a “transient climate response” (TCS), noting the near-terms effects, or by “equilibrium climate sensitivity” (ECS), which are the long-term effects, assuming that CO2 stops increasing. The higher either of these numbers is, the more we have to worry about.

Each estimate is taken from a peer-reviewed scientific paper. The first comes in 2001 from the authors Andronova and Schlesinger, with the estimate of 3oC. The highest estimate (in this graph) is 6oC in 2002 from Gregory.

Not All Jokes are Funny

Then something funny happens.

Frame puts the estimate at about 2.8oC by 2005. Skeje guessed 2.8oC in 2014. Not pictured is a paper I co-wrote in 2015, which put the estimate of ECS at 1.0oC. (This paper led to a witch hunt and hysterical accusations of “climate denial”.)

Finally, Reinhart brings it down to about 0.2oC in 2017.

From this picture we can infer at least three things. First, the debate about global warming was not over in 2000, nor in 2001, nor is it over now. The sensitivity estimates would not have changed if the debate were over. Second, the good news is that we clearly have less to worry about than we thought. This is something to celebrate, right? Right?

The third inference is that we have seen this same graph before. Not once, but many times!

You Can’t Read My Mind

It looks exactly like the graph of extrasensory perception ESP effect size through time. (I wrote a book on the subject, available free at the bottom of this page.)

J.B. Rhine in the 1930s showed the backsides of playing cards to some folks and asked them to use their ESP to “read” the frontsides. Rhine claimed great success, as did Charles Honorton and Sharon Harper in the mid-1970s using the so-called ganzfeld. The 1970s were a time of high excitement in ESP research, with extraordinary claims coming from every direction.

But then came the 1980s and 1990s, a time when []

Unless you possess the ability to remotely view objects, click here to read the rest.

Addenda US cold snap was a freak of nature, quick analysis finds. If global warming can’t explain it, it’s a “freak”, yet global warming was supposed to be a theory of how the atmosphere worked.

Also, ignore those lines and shaded gray envelopes on the plot. These are examples of the Deadly Sin of Reificaiton. They substitute what did not happen (a model) for what did (the dots).