Skip to content

Category: Class – Applied Statistics

July 23, 2018 | 6 Comments

How To Do Predictive Statistics: Part I New (Free) Software Introduction

Introduction

Here’s what we always want, but never get, using the old ways of doing statistics: The probability that some proposition Y is true given certain assumptions. Like this:

     (1) Pr(Y | assumptions).

What happens if we change the assumptions? We change the probability. This is not a bug, it is a feature. There is NO SUCH THING, not ever, as Pr(Y), i.e. Pr(Y|no assumptions). It is an impossibility. And that means frequentism as a theory of probability fails. As does any theory which takes for granted that things “have” a probability and it is our job to discover it. For proof of these claims, read this New York Times-eligible bestseller Uncertainty.

Just because you don’t write down assumptions, doesn’t mean they are not there. At the least there will be the assumptions about what Y means, about grammar and language, and maybe rules of math. There are always assumptions.

That means if we change the model, and since the model is an assumption, we change the probability. If we add or subtract “data”, and our model uses data, and since the model is an assumption, we change the probability. If we tweak the model, as in adjust it or modify its parts, like priors, and since the model is an assumption, we change the probability. Again, this is not a bug, but a necessary feature of probability. (Some people actually complain about how changing the model or prior changes the probability, but that’s because they believe in Pr(Y) which is impossible.)

Most models are ad hoc. The vast majority of models make rather gross approximations of Y, such that it lies on the continuum. (No real thing does, though, oddly, the continuum is called the “real” line.) Regression is in this class of ad hociness. It is, by far, the most used model class, judging by published “research” (sorry for scare quotes). Regression is, or rather should be, this:

     (2) Pr(Y | new X, old X & Y, Model & assumptions),

which is the probability of our proposition Y given a new assumed or measured proposition X, the old measured values of X & Y, and the ad hoc model with its assumptions. There is no notion of cause in this set up, despite how folks describe their results. All (2) says is “This is the probability of Y given these things.” That’s it, and nothing more.

For shorthand we can write

     (3) Pr(Y | XDM),

where D is the old “data”, and the others are obvious. Change anything—as in anything—on the right hand side, change the probability of Y. For the last time, this is not a bug, it is a feature.

I’ve skipped over lots of detail, because regular readers have seen it all before, and because it’s in Uncertainty and in the other posts of this on-line class (of which this post is a member).

Practicum

One of the big reasons predictive methods have not taken off is the lack of free and easy-to-use out-of-the-box software. There is tons upon metric tons of software for classical parameter- and hypothesis-testing methods. But those methods are dead to us. We do not care about parameters and we most certainly do not care for hypothesis tests. We care only about things like eq. (2) (and 3).

The best solution would be to build this software from scratch, which requires time and money, and Yours Truly has neither in any abundance (I am wholly independent; no students, no grants, no affiliations of any kind). The next best is to build upon what’s already out there. This has the advantages of speed and familiarity, but the disadvantage of being stuck in a pre-determined format with all its inherited limitations.

We’re tried other things in the past, but I’ve settled—as a start—on extending MCMCpack. With full acknowledgement of the severe limitations of simulation methods). Simulations are not magic, nor do they “have” probability. Nothing “has” probability. There is no such thing as “drawing” from a “random” distribution. Read the links: simulations are only reasonable approximations to difficult numerical integrations. That’s is, and nothing more.

MCMCpack model routines return approximations of posterior distributions of parameters of models. We have zero interest in parameters per se, since they are only part of model innards. Concentrating on parameters as if they were probabilities of observables (like Y) has led to vast, vast, incalculable over-certainties. All that is also in Uncertainty.

We can use the posteriors, though, in approximating (2), and having knowledge of the model forms. The idea is to take the approximated parameter posteriors, and feed them into a second simulation-approximation to get (2). Simple as that.

MCMCpack allows all kinds of priors to be input, and lots of other abilities to form models. Here is just one of many examples—recalling, of course, that I do not endorse their philosophical interpretation. Here for example is MCMCregress. Look at everything you can set!


Usage:

MCMCregress(formula, data = NULL, burnin = 1000, mcmc = 10000, thin = 1,
verbose = 0, seed = NA, beta.start = NA, b0 = 0, B0 = 0,
c0 = 0.001, d0 = 0.001, sigma.mu = NA, sigma.var = NA,
marginal.likelihood = c("none", "Laplace", "Chib95"), ...)

The formula is usually typed right into the method, but unfortunately, for many of its models, MCMCpack does not store the formula as part of its output. So sometimes we’ll have to create the formula outside the method so we can store it. That’s easy in R, e.g.


form = formula('y ~ x1 + x2 + x3 + x4 * x5')

That simple but unfortunate notation, incidentally is why many believe probability models say something about cause. After all, does not the formula say y is x1 plus x2 etc.? When, of course, in regression, the real formula is “central parameter of approximate normal distribution of y = x1 + x2 etc.” Normal distributions are always approximations, as are any continuous-based parameterized models. As long as we keep in mind the sometimes severe limitations this imposes, this is not a problem. Many (most?) don’t keep it in mind, however, and slip into committing the Deadly Sin of Reification, which is to assume the model is reality.

The strategy of extension is to take the output of MCMCmodel and feed it into a new method MCMCmodel.pred, for prediction. Because MCMCpack is not consistent in what it passes to its ouput, we sometimes have to track things like the model formula, the limits of y, and so on. This isn’t hard to do, but it makes the software less automatic. So ideally a wrapper would be put around each MCMCmodel that handles this. I did NOT yet write such wrappers. In fact, all the predictions methods thus far have zero error or bounds checking, nor sanity checks of any kind at all. That level of coding takes the most time (to my mind) and is the least fun. So you will have to be careful what you pass into the prediction methods. This, again, is a usage limitation, which can be corrected in time.

What I’ll be introducing in further posts are prediction methods for, so far:

  1. MCMCregress, ordinary regression
  2. MCMClogit, ordinary logistic regression
  3. MCMCmnl, multinomial (bucket) regression
  4. MCMCpoisson, Poisson regression
  5. MCMCtobit, regular regression with censoring on y.

This represents a big chunk of routine procedures, and is therefore enough of a “package” for people to get started.

Regular readers will know that multinomial regression is shockingly underused. Why? Because, I say, each analysis should be related to the decisions that will be made with the model. If no decisions are going to be made, there is no reason to do the model! Two people can have three different (or more!) decisions to be made of the same model structure, so that it should come as no surprise that they, using the same data, can come to different probability conclusions and different decisions.

All decisions, like all probability, are conditional. Since all decisions are finite and discrete, therefore so too should be the models. No y is continuous, each falls naturally into buckets delimited by a measurement process and the decisions to be made. And the multinomial does that for us (to reasonable approximation). Well, we’ll come to that when the multinomial method is introduced.

Consider this software to be version 0.1. It is strictly in the testing stage. At this point, it has very little versatility. It works, though; or, it works as near as I can tell. I am even more famous for typos in my code than in my writing, so that if there are bugs or misconceptions, do not be surprised.

Next time , we do MCMCregress.pred!

LEGAL DISCLAIMER: using this software might break your computer irrecoverably, cause black holes to form in your toilet when flushed, and encourage your children to become lawyers. Do not say you weren’t warned.

Update Download the code: mcmc.pred.R, mcmc.pred.examples.R. Update 2 If you downloaded before, download again. This is version 0.2!

June 28, 2018 | 12 Comments

Making Random Draws Is Nuts

Let’s do a little science experiment together. Go into the closet and pull out an opaque sack or bag. Anything will do, even a large sock. If you can fit your hand in, it’s fine.

Have it?

Now reach in and pull out a random observation. Pinch between your fingers a piece of probability and carefully remove it. Hold on tight! Probability is slippery. We will call this a “draw from a probability distribution”.

Done?

What does yours look like? Nothing, you say? That’s odd. Mine also looks like nothing. Let’s try again, because drawing random observations is what statisticians do all the time. If we didn’t manage to find something, the fault must lie with us, and not with the idea.

The idea is that these “random draws” will tell us what the uncertainty in some proposition is (described below). “Random draws” are used in Gibbs sampling, Metropolis-Hastings, Markov Chain Monte Carlo, bootstrap, and the like, which we can call MCMC for short.

Maybe the sack is the problem. Doesn’t seem to be anything in there. Maybe that’s because there is no such thing as probability in the sense that it is not a physical thing, not a real property of objects? Nah.

Let’s leave that aside for now and trade the sack for a computer. After all, statisticians use computers. Reach instead inside your computer for a random draw. Still nothing? Did you have the ONOFF switch in the OFF position?

You probably got nothing because when you reach into the computer for a random draw you have to do it in a specific way. Here’s how.

Step one: select a number between 0 and 1. Any will do. Close your eyes when picking, because in order to make the magic work, you have to pretend not to see it. I do not jest. Now since this pick will be on a finite, discrete machine, the number you get will be in some finite discrete set on 0 to 1. There is no harm in thinking this set is 0.1, 0.2, …, 0.9 (or indeed only, say, 0.3333 and 0.6666!). Call your number s. (Sometimes you have to pick more than one s at a time, but that is irrelevant to the philosophy.)

Step two: Transform s with a function, f(s). The function will turn s into a “random draw” from the “distribution” specified by f(). This function is like a map, like adding 1 to any number is. In other words, f(s) could be f(s) = s + 1. It will be more complicated than that, of course, but it’s the same idea. (And s might be more than one number, as mentioned.)

Step three: That f(s) is usually fed into an algorithm which transforms it again with another function, sort of like g(f(s)). That f(s) becomes input to a new function. The output of g(f(s)) is the answer we wanted, almost, which was the uncertainty of the proposition of interest.

Step four: Repeat Steps one-three many times. The result will be a pile of g(f(s)), each having a different value for every s found in Step one. From this pile, we can ask, “How many g(f(s)) are larger than x?” and use that as a guess for the probability of seeing values larger than x. And so on.

Steps two-four are reasonable and even necessary because we cannot often solve for the uncertainty of the proposition of interest analytically. The math is too hard. So we have to derive approximations. If you know any calculus, it is like finding approximations to integrals that don’t have easy solutions. You plot the curve, and bust it up into lots of sections, compute the area of each section, then add them up.

Same idea here. Except for the bit about magic. Let’s figure out what that is.

Now instead of picking “randomly”, we could just cycle through the allowable, available s, which we imagined could equal 0.1, 0.2, …, 0.9. That would give us 9 g(f(s))s. And that pile could be used as in Step four. No problem!

Of course, having only 9 in the pile would have the same effect of only slicing an integral coarsely. The approximation won’t be great. But it will be an approximation. The solution is obvious: increase the number of possible s. Maybe 0.05, 0.10, 0.15, …, 0.95 is better. Try it yourself! (There are all sorts of niceties about selecting good s as part of the steps which do not interest us philosophically. We’re not after efficiency here, but understanding.)

This still doesn’t explain the magic. To us, random means not predictable with certainty. Our sampling from s is not random, because we know with certainty what s is (as well as what f() and g() are). We are just approximating some hard math in a straightforward fashion. There is no mystery.

To some, though, random it is a property of a thing. That’s why they insist on picking s with their eyes closed. The random of the s is real, in the same way probability is real. The idea is that the random of the s, as long as we keep our eyes closed, attaches itself to f(s), which inherits the random, and f(s) in turn paints g(f(s)) with its random. Thus, the questions we ask of the pile of g(f(s))s is also random. And random means real probability. The magic has been preserved!

As long as you keep your eyes closed, that is. Open them at any point and the random vanishes! Poof!

“Are you saying, Briggs, that those who believe in the random get the wrong answers?”

Nope. I said they believe in magic. Like I wrote in the link above, it’s like they believe gremlins are what make their cars go. It’s not that cars don’t go, it’s that gremlins are the wrong explanation.

“So what difference does it make, then, since they’re getting the right answers? You just like to complain.”

Because it’s nice to be right rather than wrong. Probability and randomness are not real features of the world. They are purely epistemic. Once people grasp that, we can leave behind frequentism and a whole host of other errors.

“You are a bad person. Why should I listen to you?”

Because I’m right.

June 26, 2018 | 13 Comments

There Is No Prior? What’s A Bayesian To Do? Relax, There’s No Model, Either

I saw colleague Deborah Mayo casting, or rather trying to cast, aspersions on Bayesian philosophy by saying there is “no prior”.

Bayesians might not agree, but it’s true. Mayo’s right. There is no prior.

There’s no prior, all right, but there’s no model, either. So into the tubes goes frequentism right ahead of (subjective) Bayes.

We here at WMBriggs.com adopt a third way: probability. Probability is neither frequentist nor Bayesian, as outlined in that magnificent book Uncertainty: The Soul of Modeling, Probability & Statistics.

Now specifically, Mayo tweeted “There’s no such thing a ‘the prior’! The onus on Bayesians is to defend & explain the meaning/obtainability of the ones they prefer out of many, types (even within a given umbrella. Is the principle of indifference snuck in?”

Way subjective Bayes works is to suppose an ad hoc, pulled from some mysterious dark realm, parameterized continuous probability model. Which is exactly the same way frequentism starts. Bayes puts ad hoc probabilities on the parameters, frequentism doesn’t. Bayes thus had two layers of ad hociness. Frquentism has at least that many, because while frequentism pretends there is no uncertainty in the parameters, except that which can be measured after infinite observations are taken, frequentism adds testing and other fallacious horrors on top of the ad hoc models.

The models don’t exist because they’re made up. The priors are also made up, and so they don’t exist, either. But a frequentist complaining about fiction to a (subjective) Bayesian is like a bureaucrat complaining about the size of government.

Frequentists and, yes, even objective Bayesians believe probability exists. That it’s a thing, that it’s a property of any measurement they care to conjure. Name a measurement, any measurement at all—number of scarf-wearing blue turtles that walk into North American EDs—and voilà!, a destination at infinity is instantaneously created at which the probability of this measurement lives. All we have to do know this probability is take an infinite number of measurements—and there it is! We’ll then automatically know the probability of the infinity-plus-one measurement without any error.

No frequentist can know any probability because no infinite number of measures has yet been taken. Bayesians of the objective stripe are in the same epistemic canoe. Subjectivists carve flotation devices out of their imaginations and simply make up probability, guesses which are influenced by such things as how many jalapeno peppers they ate the day before and whether a grant is due this week or next month.

Bayesians think like frequentists. That’s because all Bayesians are first initiated into frequentism before they are allowed to be Bayesians. This is like making Catholic priests first become Mormon missionaries. Sounds silly, I know. But it’s a way to fill classrooms.

Frequentists, like Bayesians, and even native probabilists like ourselves can assume probabilities. They can all make statements like “If the probability of this measurement, assuming such-and-such information, is p, then etc. etc. etc.” That’s usually done to turn the measurement into math, and math is easier to work with than logic. Leads to the Deadly Sin of Reification too often, though; but that’s a subject for another time. Point is: there is nothing, save the rare computational error, wrong with this math.

Back to Mayo. Frquentists never give themselves an onus. On justifying their ad hoc models, that is, because they figure probability is real, and that if they didn’t guess just the right parameterized continuous model, it’ll be close enough the happy trail ends at infinity.

Only infinity never comes.

You’d think that given all we know about the paradoxes that arise from the paths taken to reach infinity, and that most measurements are tiny in number, and that measurements themselves are often ambiguous to high degree, that frequentists would be more circumspect. You would be wrong.

The Third Way is just probability. We take what we know of the measurement and from that deduce the probability. Change this knowledge, change the probability. No big deal. Probability won’t always be quantifiable, or easy, and it won’t always be clear that the continuous infinite approximations we make to our discrete finite measurements will be adequate, but mama never said life was fair. We leave that to SJWs.

If I had my druthers, no student would learn of Bayes (I mean the philosophy; the formula is fine, but is itself is not necessary) or frequentism untill well on his way to a PhD in historical statistics. We’d start with probability and end with it.

Maybe that’s why they don’t let me teach the kiddies anymore.

Update I’ll have an article on the Strong law and why it doesn’t prove probability is ontic, and why using it to show probability is ontic is a circular argument, and why (again) frequentism fails. Look for it after the 4th of July week. Next week will be mostly quiet. If you’re in a hurry, buy the book!

June 5, 2018 | 7 Comments

Lovely Example of Statistics Gone Bad

The graph above (biggified version here) was touted by Simon Kuestenmacher (who posts many beautiful maps). He said “This plot shows the objects that were found to be ‘the most distant object’ by astronomers over time. With ever improving tools we can peak further and further into space.”

The original is from Reddit’s r/dataisbeautiful, a forum where I am happy to say many of the commenter’s noticed the graph’s many flaws.

Don’t click over and don’t read below. Take a minute to first stare at the pic and see if you can see its problems.

Don’t cheat…

Try it first…

Problem #1: The Deadly Sin of Reification! The mortal sin of statistics. The blue line did not happen. The gray envelope did not happen. What happened where those teeny tiny too small block dots, dots which fade into obscurity next to the majesty of the statistical model. Reality is lost, reality is replaced. The model becomes realer than reality.

You cannot help but be drawn to the continuous sweet blue line, with its guardian gray helpers, and think to yourself “What smooth progress!” The black dots become a distraction, an impediment, even. They soon disappear.

Problem #1 one leads to Rule #1: If you want to show what happened, show what happened. The model did not happen. Reality happened. Show reality. Don’t show the model.

It’s not that models should never be examined. Of course they should. We want good model fits over past data. But since good models fits over past data are trivial to obtain—they are even more readily available than student excuses for missing homework—showing your audience the model fit when you want to show them what happened misses the point.

Of course, it’s well to separately show model fit when you want to honestly admit to model flaws. That leads to—

Problem #2: Probability Leakage! What’s the y-axis? “Distance of furthest object (parsecs).” Now I ask you: can the distance of the furthest object in parsecs be less than 0? No, sir, it cannot. But does both the blue line and gray guardian drop well below 0? Yes, sir, they do. And does that imply the impossible happened? Yes: yes, it does.

The model has given real and substantial probability to events which could not have happened. The model is a bust, a tremendous failure. The model stinks and should be tossed.

Probability leakage is when a model gives positive probability to events we know are impossible. It is more common than you think. Much more common. Why? Because people choose the parametric over the predictive, when they should choose predictive over parametric. They show the plus-or-minus uncertainty in some who-cares model parameters and do not show, or even calculate, the uncertainty in the actual observable.

I suspect that’s the case here, too. The gray guardians are, I think, the uncertainty in the parameter of the model, perhaps some sort of smoother or spline fit. They do not show the uncertainty in the actual distance. I suspect this because the gray guardian shrinks to near nothing at the end of the graph. But, of course, there must still be some healthy uncertainty in the model distant objects astronomers will find.

Parametric uncertainty, and indeed even parameter estimation, are largely of no value to man or beast. Problem #2 leads to Rule #2: You made a model to talk about uncertainty in some observable, so talk about uncertainty in the observable and not about some unobservable non-existent parameters inside your ad hoc model. That leads to—

Problem #3: We don’t know what will happen! The whole purpose of the model should have been to quantify uncertainty in the future. By (say) the year 2020, what is the most likely distance for the furthest object? And what uncertainty is there in that guess? We have no idea from this graph.

We should, too. Because every statistical model has an implicit predictive sense. It’s just that most people are so used to handling models in their past-fit parametric sense, that they always forget the reason the created the model in the first place. And that was because they were interested in the now-forgotten observable.

Problem #3 leads to Rule #3: always show predictions for observables never seen before (in any way). If that was done here, the gray guardians would take on an entirely different role. They would be “more vertical”—up-and-down bands centered on dots in future years. There is no uncertainty in the year, only in the value of most distant object. And we’d imagine that that uncertainty would grow as the year does. We also know that the low point of this uncertainty can never fall below the already known most distant object.

Conclusion: the graph is a dismal failure. But its failures are very, very, very common. See Uncertainty: The Soul of Probability, Modeling & Statistics for more of this type of analysis, including instruction on how to do it right.

Homework Find examples of time series graphs that commit at least one of these errors. Post a link to it below so that others can see.