Skip to content

Category: Statistics

The general theory, methods, and philosophy of the Science of Guessing What Is.

December 10, 2017 | 6 Comments

Statistical Consulting

Hi, gang. This is a placeholder page advertising my services as a consultant, speaker, and teacher.

I’m moving things hither and thither, making a place for featured posts, most of which will be of temporary importance, and others will be permanent fixtures. I’ll be shifting the pages about, editing them to make more sense.

Comments are open. The new theme is disconcerting, as all change is, but I think we’ll grow used to it. Suggestions for tweaks are welcome. Recall that the main purpose of this site is to feed me. I don’t have any formal position anywhere, and use this blog as advertising for myself (feel free to insert your jokes here).

The News sticky post will move to a page, and be updated when necessary.

I tried making most of the changes over the slow weekend, in the mornings and evenings. There were some unavoidable interruptions. Apologies for that. Every theme looks great on the samples, until you try them out on your own material, where suddenly many multiples of tweaks are discovered to be needed. I am still making these.

Our Summa Contra Gentiles series resumes next week.

I’ll update this post when and if necessary.

Update One thing that’s possible is all posts can be displayed in toto instead of in excerpt, but this would means two columns of narrow, full posts. It would save people from having to click into an article if all they want is to read it.

December 8, 2017 | 6 Comments

The Substitute For P-values Paper is Popular

The Substitute For P-values paper is popular. Received an email from the American Statistical Association informing me of the unusual viewing activity. The email copies this earlier email (I’m cutting out the names):

No problem! I also wanted to let you know of another article that appeared as one of “Taylor & Francis’ top ten Altmetrics articles” last week (and is still doing well). It’s “The Substitute for p-Values,” by William M. Briggs (Vol 112, Issue 519 of JASA). So far, it’s seen 149 tweets from 143 users, with an upper bound of 150,371 followers! Below is the Altmetric score:

All the best,

[A]

I had never heard of Altmetric, but on looking at their list of the top 100 papers of 2015, paper number 100 had a score of 854 (top had 2782). Fame still awaits.

Paper 100, incidentally, was “Human language reveals a universal positivity bias.” Not at this blog, buster.

The main email said this:

Dear Dr. Briggs, I just thought I would make you aware that your comment “The Substitute for p-Values” (http://amstat.tandfonline.com/doi/full/10.1080/01621459.2017.1311264)

Has been viewed more than 3,000 times and is still very popular on social media (see below).

Thank you so much for your contribution to JASA! [E], ASA Journals Manager

The link to the official paper is above (here too). The original post about it is here. The book page for Uncertainty, which contains all the meat and proofs of contentions in the paper, is here. Uncertainty can be bought here.

Don’t miss the free Data Science course, which puts all the ideas of the paper into action. This course is neither frequentist nor Bayesian nor machine learning/artificial intelligence, but pure probability.

Bonus correlation!

Just look at that! The editors “best books” next to readers’ favorite book. The p-value measuring this correlation must be mighty wee! Weer than wee! Wee wee. All the way home!

December 5, 2017 | 9 Comments

Free Data Science Class: Predictive Case Study 1, Part IV

Review!

Code coming next week!

Last time we decided to put ourselves in the mind of a dean and ask for the chance of CGPA falling into one of these buckets: 0,1,2,3,4. We started with an simple model to characterize our uncertainty in future CGPAs, which was this:

    (5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).

Now the “fixed math notions” means, in this case, that we uses a parameterized probability multinomial distribution (look it up anywhere). This model via the introduction of non-observable, non-real parameters, little bits of math necessary for the equations to work out, gives the probability of belonging to one of the buckets, which in this case are 5, 0-4.

The parameters themselves are the focus of traditional statistical practice, in both its frequentist and Bayesian flavors. This misplaced concentration came about for at least two reasons: (a) the false belief that probabilities are real and thus so are parameters, at least “at” infinity and (b) the mistaking of knowledge of the parameters for knowledge of observables. The math for parameters (at infinity) is also easier than looking at observables. Probability does not exist, and (of course) we now know knowledge of the parameters is not knowledge of observables. We’ll bypass all of this and keep our vision fixed on what is of real interest.

Machine learning (and AI etc.) have parameters for their models, too, for the most part, but these are usually hidden away and observables are primary. This is a good thing, except that the ML community (we’ll lump all non-statistical probability and “fuzzy” and AI modelers into the ML camp) created for themselves new errors in philosophy. We’ll start discussing these this time.

Our “fixed math notions”, as assumption we made, but with good reason, include selecting a “prior” on the parameters of the model. We chose the Dirichlet; many others are possible. Our notions also selected the model. Thus, as is made clear in the notation, (5) is dependent on the notions. Change them, change answer to (5). But so what? If we change the grading rules we also change the probability. Changing the old observations also changes the probability.

There is an enormous amount of hand-wringing about the priors portion of the notions. Some of the concern is with getting the math right, which is fine. But much is because it is felt there are “correct” priors somewhere out there, usually living at Infinity, and if there are right ones we can worry we might not have found the right ones. There are also many complaints that (5) is reliant on the prior. But (5) is always reliant on the model, too, though few are concerned with that. (5) is dependent on everything we stick on the right hand side, including non-quantifiable evidence, as we saw last time. That (5) changes when we change our notions is not a bug, it is a feature.

The thought among both the statistical and ML communities is that a “correct” model exists, if only we can find it. Yet this is almost never true, except in those rare cases where we deduce the model (as is done in Uncertanity for a simple case). Even deduced models begin with simpler knowns or assumptions. Any time we use a parameterized model (or any ML model) we are making more or less ad hoc assumptions. Parameters always imply lurking infinities, either in measurement clarity or numbers of observations, infinities which will always be lacking in real life.

Let’s be clear: every model is conditional on the assumptions we make. If we knew the causes of the observable (here CGPA; review Part I) we could deduce the model, which would supply extreme probabilities, i.e. 0s and 1s. But since we cannot know the causes of grade points, we can instead opt for correlation models, as statistical and ML models are (any really complex model may have causal elements, such as in physics etc., but these won’t be completely causal and thus will be correlational in output).

This does not mean that our models are wrong. A wrong model would always misclassify and never correctly classify, and it would do so intentionally, as it were. This wrong model would be a causal model, too, only it would purposely lie about the causes in at least some instances.

No, our models are correlational, almost always, and therefore can’t be wrong in the sense just mentioned; neither can they be right in the causal sense. They can, however, be useful.

The conceit by statistical modelers is that, once we have in hand correlates to our observables, which in this case will be SAT scores and high school GPAs, that if we increase our sample size large enough, we’ll know exactly how SAT and HGPA “influence” CGPA. This is false. At best, we’ll sharpen our predictive probabilities to some extent, but we’ll hit a wall and will go no further. This is because SAT scores do not cause CGPAs. We may know all there is to know about some non-interesting parameters inside some ad hoc model, but this intensity will not transfer to the observables, which may be as murky as ever. If this doesn’t make sense, the examples will clarify it.

The similar conceit of the ML crowd is that if only the proper quantity of correlates are measured, and (as with the statisticians) measured in sufficient number, all classification mistakes will disappear. This is false, too. Because unless we can measure all the causes of each and every person’s CGPA, the model will err in the sense that it will not produce extreme probabilities. Perfect classification is a chimera—and a sales pitch.

Just think: we already know we cannot know all the precise causes of many contingent events. For example, quantum measurements. No parameterized model nor the most sophisticated ML/AI/deep learning algorithm in the world, taking all known measurements as input, will classify better than the simple physics model. Perfection is not ours to have.

Next time we finally—finally!—get to the data. But remember we were in no hurry, and that we purposely emphasized the meaning and interpretation of models because there is so much misinformation here, and because these are, in the end, the only parts that matter.

December 4, 2017 | 2 Comments

There Is No “Problem” Of Old Evidence In Bayesian Theory

Update I often do a poor job setting the scene. Today we have the solution to an age-old problem (get it? get it?), a “problem” thought to be a reason not to adopt (certain aspects of) Bayesian theory or logical probability. I sometimes think solutions are easier to accept if they are at least as difficult as the supposed problems.

I was asked to comment by Bill Raynor on Deborah Mayo’s article “The Conversion of Subjective Bayesian, Colin Howson, & the problem of old evidence“.

Howson is Howson of Howson & Urbach, an influential book that showed the errors of frequentism, but then introduced a few new ones due to subjectivity. We’ve talked time and again on the impossibility that probability is subjective (where probability depends on how many scoops of ice cream the scientist had before taking measurements), but we’ve never yet tackled the so-called problem of old evidence. There isn’t one.

Though there is no problem of evidence, old or new, there are plenty of problems with misleading notation. All of this is in Uncertainty.

The biggest error, found everywhere is probability, is to only partially write down the evidence one has for a proposition, and then that information “float”, so that the one falls prey to equivocation.

Mayo:

Consider Jay Kadane, a well-known subjective Bayesian statistician. According to Kadane, the probability statement: Pr(d(X) >= 1.96) = .025

“is a statement about d(X) before it is observed. After it is observed, the event {d(X) >= 1.96} either happened or did not happen and hence has probability either one or zero” (2011, p. 439).

Knowing d0= 1.96, (the specific value of the test statistic d(X)), Kadane is saying, there’s no more uncertainty about it.* But would he really give it probability 1? If the probability of the data x is 1, Glymour argues, then Pr(x|H) also is 1, but then Pr(H|x) = Pr(H)Pr(x|H)/Pr(x) = Pr(H), so there is no boost in probability for a hypothesis or model arrived at after x. So does that mean known data doesn’t supply evidence for H? (Known data are sometimes said to violate temporal novelty: data are temporally novel only if the hypothesis or claim of interest came first.) If it’s got probability 1, this seems to be blocked. That’s the old evidence problem. Subjective Bayesianism is faced with the old evidence problem if known evidence has probability 1, or so the argument goes.

Regular readers (or those who have understood Uncertainty) will see the problem. For those who have not yet read that fine, award-eligible book, here is the explanation.

To write “Pr(d(X) > 1.96)” is to make a mistake. The proposition “d(X) > 1.96” has no probability. Nothing has a probability. Just like all logical argument require premises, so do all probabilities. They are here missing, and they are later supplied in different ways and equivocation occurs. In this case deadly equivocation.

We need a right hand side. We might write

     (1) Pr(d(X) > 1.96 | H),

where H is some compound, complex proposition that supplies information about the observable d(X), and what the (here anyway) ad hoc probability model for d(X) is. If this model allows quantification, we can calculate a value for (1). Unless that model insists “d(X) > 1.96” is impossible or certain, the probability will be non-extreme (i.e. not 0 or 1).

Suppose we actually observe some d(X_o) (o-for-observed). We can calculate

     (2) Pr(d(X) > d(X_o) | H)

and unless d(X_o) is impossible or certain, then again we’ll calculate some non-extreme number. (2) is almost identical with (1) but with a possibly different number than 1.96 for d(X_o). The following equation is not the same:

     (3) Pr( 1.96 >= 1.96 | H),

which indeed has a probability of 1.

Of course! “I observed what I observed” is a tautology where knowledge of H is irrelevant. The problem comes in there to put the actual observation, of the right or left hand side.

Take the standard evidence of a coin flip C = “Two-sided object which when flipped by show one of h or t”, then Pr(h | C) = 1/2. One would not say because one just observed a tail on an actual flip that, suddenly, Pr(h | C) = 0. Pr(h | C) = 1/2 because that 1/2 is deduced from C about h. (h is the proposition “An h will be observed”).

Pr(I saw an h | I saw an h & C) = 1, and Pr(A new h | I saw an h & C) = 1/2. It is not different from 1/2 because C says nothing about how to add evidence of new flips.

Suppose for ease d() is “multiply by 1” and H says X follows a standard normal (ad hoc is ad hoc, so why not?). Then

     (4) Pr(X > 1.96 | H) = 0.025.

If an X of (say) 0.37 is observed, then what does (4) equal? The same. But this is not (4):

     (5) Pr(0.37 > 1.96 | H) = 0,

but because of the assumption H includes, as it always does, tacit and implicit knowledge of math and grammar.

Or we might try this:

     (6) Pr(X > 1.96 | I saw an old X = 0.37 & H) = 0.025.

The answer is also the same because H like C says nothing about how to take old X and modify the model of X.

Now there are problems in this equation, too:

     (7) Pr(H|x) = Pr(H)Pr(x|H)/Pr(x) = Pr(H).

There is no such thing as “Pr(x)” nor does “Pr(H)” exist and we already seen it is false that “Pr(x|H) = 1”.

Remember: Nothing has a probability! Probability does not exist. Probability, like logic, is a measure of a proposition of interest with respect to premises. If there are no premises, there is no logic and no probability.

Better notation is:

     (8) Pr(H|xME) = Pr(x|HME)Pr(H|ME)/Pr(x|ME),

where M is a proposition specifying information about the ad hoc parameterized probability model, H is usually a proposition saying something about one or more of the parameters of M, but it could also be a statement about the observable itself, and x is a proposition about some observable number. And E is a compound proposition that includes assumptions about all the obvious things.

There is no sense that Pr(x|HME) nor Pr(x|ME) equals 1 (unless we can deduce that via H or ME) before or after any observation. To say so if to swap in an incorrect probability formulation, like in (5) above.

There is therefore no old evidence problem. There are many self-created problems, though, due to incorrect bookkeeping and faulty notation.