William M. Briggs

Statistician to the Stars!

Improper Language About Priors

A Christmas distribution of posteriors.

A Christmas distribution of posteriors. Image source.

Suppose you decided (almost surely by some ad hoc rule) that the uncertainty in some thing (call it y) is best quantified by a normal distribution with central parameter θ and spread 1. Never mind how any of this comes about. What is the value of θ? Nobody knows.

Before we go further, the proper answer to that question almost always should be: why should I care? After all, our stated goal was to understand the uncertainty in y, not θ. Besides, θ can never be observed; but y can. How much effort should we spend on something which is beside the point?

If you answered “oodles”, you might consider statistics as a profession. If you thought “some” was right, stick around.

Way it works is that data is gathered (old y) which is then used to say things, not about new y, but about θ. Turns out Bayes’s theorem requires an initial guess of the values of θ. The guess is called “a prior” (distribution): the language that is used to describe it is the main subject today.

Some insist that that the prior express “total ignorance”. What can that mean? I have a proposition (call it Q) about which I tell you nothing (other than it’s a proposition!). What is the probability Q is true? Well, given your total ignorance, there is none. You can’t consistent with the evidence say to yourself anything like, “Q has got to be contingent, therefore the probability Q is true is greater than 0 and less than 1.” Who said Q had to be contingent? You are in a state of “total ignorance” about Q: no probability exists.

The same is not true, and cannot be true, of θ. Our evidence positively tells us that “θ is a central parameter for a normal distribution.” There is a load of rich information in that proposition. We know lots about “normals”; how they give 0 probability to any observable, how they give non-zero probability to any interval on the real line, that θ expresses the central point and must be finite, and so on. It is thus impossible—as in impossible—for us to claim ignorance.

This makes another oft-heard phrase “non-informative prior” odd. I believe it originated from nervous once-frequentist recent converts to Bayesian theory. Frequentists hated (and still hate) the idea that priors could influence the outcome of an analysis (themselves forgetting nearly the whole of frequentist theory is ad hoc) and fresh Bayesians were anxious to show that priors weren’t especially important. Indeed, it can even be proved that in the face of rich and abundant information, the importance of the prior fades to nothing.

Information, alas, isn’t always abundant thus the prior can matter. And why shouldn’t it? More on that question in a moment. But because some think the prior should matter as little as possible, it is often suggested that the prior on θ should be “uniform”. That means that, just like the normal itself, the probability θ takes any value is zero, the probability of any interval is non-zero; it also means that all intervals of the same length have the same probability.

But this doesn’t work. Actually, that’s a gross understatement. It fails spectacularly. The uniform prior on θ is no longer a probability, proved easily by taking the integral of the density (which equals 1) over the real line, which turns out to be infinite. That kind of maneuver sends out what philosopher David Stove called “distress signals.” Those who want uniform priors are aware that they are injecting non-probability into a probability problem, but still want to retain “non-informatativity” so they call the result an “improper prior”. “Prior” makes it sound like it’s a probability, but “improper” acknowledges it isn’t. (Those who use improper priors justify them saying that the resultant posteriors are often, but not always, “proper” probabilities. Interestingly, “improper” priors in standard regression gives identical results, though of course interpreted differently, to classical frequentism.)

Why shouldn’t the prior be allowed to inform our uncertainty in θ (and eventually in y)? The only answer I can see is the one I already gave: residual frequentist guilt. It seems obvious that whatever definite, positive information we have about θ should be used, the results following naturally.

What definite information do we have? Well, some of that has been given. But all that ignores whatever evidence we have about the problem at hand. Why are we using normal distributions in the first place? If we’re using past y to inform about θ, that means we know something about the measurement process. Shouldn’t information like that be included? Yes.

Suppose the unit in which we’re measuring y is inches. Then suppose you have to communicate your findings to a colleague in France, a country which strangely prefers centimeters. Turns out that if you assumed, like the normal, θ was infinitely precise (i.e. continuous), the two answers—inches or centimeters—would give different probabilities to different intervals (suitably back-transformed). How can it be that merely changing units of measurement changes probabilities! Well, that’s a good question. It’s usually answered with a blizzard of mathematics (example), none of which allays the fears of Bayesian critics.

The problem is that we have ignored information. The yardstick we used is not infinitely precise, but has, like any measuring device anywhere, limitations. The best—as inbest—that we can do is to measure y from some finite set. Suppose this it to the nearest 1/16 of an inch. That means we can’t (or rather must) differentiate between 0″ and something less than 1/16″; it further means that we have some upper and lower limit. However we measure, the only possible results will fall into some finite set in any problem. Suppose this is 0″, 1/16″, 2/16″,…, 192/16″ (one foot; the exact units or set constituents do not matter, only that they exist does).

Well, 0″ = 0 cm, and 1/16″ = 0.15875 cm, and so on. Thus if the information was that any of the set were possible (in our next measurement of y), the probability of (say) 111/16″ is exactly the same as the probability of 17.6213 cm (we’ll always have to limit the number of digits in any number; thus 1/3 might in practice equal 0.333333 where the 3’s eventually end). And so on.

It turns out that if you take full account of the information, the units of measurement won’t matter! Notice also that the “prior” in this case was deduced from the available evidence; there was nothing ad hoc or “non-informative” about it at all (of course, other premises are possible leading to other deductions).

But then, with this information, we’re not really dealing with normal distributions. No parameters either: there is no θ in this setup. Ah. Is that so bad? We’ve given up the mathematical convenience continuity brings, but our reward is accuracy—and we never wander away from probability. We can still quantify the uncertainty in future (not yet seen) values of y given the old observations and knowledge of the measurement process, albeit at the price of more complicated formula (which seem more complicated than it really is at least because fewer people have worked on problems like these).

And we don’t really have to give up on continuity as an approximation. Here’s how it should work. First solve the problem at hand—quantifying the uncertainty in new (unseen) values of y given old ones and all the other premises available. I mean, calculate that exact answer. It will have some mathematical form, part of which will be dependent on the size or nature of the measurement process. Then let the number of elements in our measurement set grow “large”, i.e. take that formula to the limit (as recommended by, inter alia, Jaynes). Useful approximations will result. It will even be true that in some cases, the old stand-by, continuous-from-the-start answers will be rediscovered.

Best of all, we’ll have no distracting talk of “priors” and (parameter) “posteriors”. And we wouldn’t have to pretend continuous distributions (like the normal) are probabilities.

12 Comments

  1. I cannot help but repeat one of my several-year-old unanswered questions to you: how did you derive the claim that the probability of a Romney win was 0.8?

    Apply what you say here! The following steps might help.

    (1) Objective – Evidently, you wished to estimate of the probability of Romney’s winning.

    (2) Data collection – What information/data were collected.

    (3) Bayesian framework – How did you form your prior and likelihood? How was (1) answered? Explain what the parameters are if you decide to use them. (Yep, you’d have to use them!) I ‘d accept subjective Bayesian analysis , if you are an expert in the area of polling and politics, why not using your knowledge to form your prior? Right?

  2. JH,

    I’m not telling you my Romney secrets! Except to say properly formed models aren’t always good ones.

    Down with frequentism! Bwahahahahahaha!

  3. Isn’t the problem with political predictions that one can very easily lack information that is crucial to the outcome? For example, Republicans deciding not to vote? If no one asks the question “are you going to vote?” then this can be missed. Especiall if the question is “Who would you vote for”? How one phrases the questions, obtains the data, etc is crucial. (Which is not to say this is Briggs’ method, of course.)

    Actually, that’s a problem anywhere with prediction. It’s limited to the input parameters. If anything is missed, the model fails. I’m sure it was a good model, though!

  4. My Dear Mr. Briggs,

    Secrets? That’s what my younger brother said when he had no answers. I still love him and would do anything for him though.

    So, still no answer.

    Down with frequentism! Bwahahahahahaha!

    I just emailed Santa; you did not made the Nice list this year. I am not laughing though. Perhaps, you could make another wish. Or better yet, somehow convince people by doing, e.g., answering my question.

  5. JH: How did you get Santa’s email address and why would Santa need to be notified anyway—he’s supposed to know what you’re up to no matter where you are and how you try to hide? So the probability that Santa has already placed Briggs on the naughtly list if Santa feels what Briggs did was bad is 100%.

  6. Santa has a website where you can email him. Apparently he lives in Canada. Though I expect Norway, Russia, Denmark, and the US (via Alaskan claims) all want a piece of Santa’s Oil Reserves Workshop.

    http://emailsanta.com/

  7. I found several websites that are supposed to be for emailing Santa and one that lets you check on who’s on the naughty or nice list. I’d ask how Santa could have all those email addresses, but I gave up on that back when no one could explain all the Santas at all the malls.

    He probably lives in Canada because the Arctic is melting. There was a cartoon to that effect a few years back to scare kids into turning off lights and sending money to Greenpeace because the North Pole was melting. There is no end to the dishonesty of environmentalists.

  8. Sheri,

    This might not explain it but does illustrate the phenomenon.
    He’s everywhere!

    That’s the prior anyway. Not sure if it’s uninformative.

  9. DAV: I had forgotten how funny Ray Stevens could be! Actually, as we got older, we would make up stuff like Santa had a transporter and holographic projector, he could use a tesseract (like A Wrinkle in Time) or maybe everything looked smaller on the outside like the tardis. Sci fi can provide some interesting theories!

    Yes, I think that is the prior in this.

  10. Sheri, I did not notify Santa of anything. I asked him a question. Santa’s email address is a secret.

  11. JH: NO!!!! All those addresses and websites are FAKE!!!! NOOOOOOOOO!

    Wait, are we talking about Santa or Briggs here? I thought you asked Briggs a question. Now I am so confused….Okay, I’m confused most of the time, but more so now. 🙂

  12. As I understand it, one of the appeals of a Bayesian aapproach is being able to update your probabilities with new information.
    So what’s the difference between using some information to construct a non-uniform prior and using that information to update a uniform prior?

    Also, a practical example would be nice.

Leave a Reply

Your email address will not be published.

*

© 2016 William M. Briggs

Theme by Anders NorenUp ↑