It’s a lazy Saturday, so some musings today on entropy and information and probability. It’s about time we started tying these things together.
Things like the following are heard: “Given a list of premises, I judge the probability of rain tomorrow at 0%”. Swap in for “rain tomorrow” your favorite prediction: “the team wins”, “the customer buys”, “the man defaults”, and so on.
Now zero-percent, a probability of 0, is mighty strong. The strongest. It means the proposition is impossible: not unlikely, impossible. Impossible things cannot happen. Even God cannot do the impossible.
Yet impossible events occur all the time. It rains when it’s not supposed to, the team loses, the customer leaves, the man honors. And so on. Something has gone wrong.
The failure, as always, is not to keep assiduous track of our language. Equivocation creeps in unaware. Impossible doesn’t mean impossible but conditionally impossible. Why?
All (as in all) probability is conditional. That means some list of premises existed which allowed the judgement of a probability of 0. That is, the forecaster had some model—a model is a list of premises—which spit out “probability 0” for his proposition.
Since it did later rain, we have falsified this model, i.e. this list of premises. But we haven’t shown which of the main premises are false: we have only proved that at least one premise of the model isn’t so. The “model”, i.e. collection of premises as a whole is false, it is wrong, but there may be pieces of it which are true. So much we already know.
Introduce the idea of “probabilistic surprise.” A surprising event is a rare one. If our model—you simply must get into the habit of thinking of any model as a list of premises; however “list of premises” is not as compact as “model”—says the probability of some proposition is low, then, conditional on that model, is the proposition is observed to be true, we are surprised. Rarer events are more surprising.
Think how you’d feel winning the lottery. The model easily lets us deduce the probability for the proposition “I win.” This is small; your shock at winning is correspondingly large.
Propositions which are certain conditional on the model are not surprisingly; indeed, since they must happen, since they have a probability of one, they are inevitable. Who could be surprised but what must occur? (Insert your own joke here.)
Lotteries and other dichotomous situations are great examples since we can easily track possibilities. Tracking possibilities isn’t always easy, nor even always possible with every model (list of premises!). Tracking means deducing every (as in every) proposition which has a probability relative to the model. In formal situations, we’re fine; informally, not so fine. So let’s stick with dichotomy, which is free beer.
Turns out, with this definition of “surprise”, or rather believing the notion that we can quantify surprise, a move which is open to debate, we can deduce a formula for the amount of surprise we can expect given we have tracked the model and identified all the propositions and deduced their conditional probabilities (pi). This formula is:
.
Yes, entropy. Now, isn’t that curious?
Update See below for comments about calculating entropies. I originally had calculations here (something I don’t normally do) and because of my sloppiness (and laziness) I distracted us from the main points.
What about impossible propositions? Once again, our habitual slackness with language leads to mistakes. If a conditionally impossible event happens, according to our derivation, our surprise should be infinite.
Infinite surprise!
In one way, this is right. If something truly impossible happened—these events are defined just as we define necessary truths, i.e. from indubitable first principles and irresistible deduction—then our surprise would surely be infinite. And deadly. Who could handle the shock? And since entropy is often given physical interpretation, impossible events imply the Trump of Doom.
But it’s obviously wrong. Impossible events which really occur always mean a broken model. They always imply that a false premise has snuck in and been believed.
The lesson (the only one we can do today) models which assign zero probabilities to contingent events are inherently flawed. (Contingent events are those which are not logically necessary or—you guessed it—logically impossible. It all ties together!)
Well, that’s it today. We haven’t done information nor scores nor a world of other things.
Briggs “Infinite surprise! Infinite entropy!”
Zero, I believe.
Scotian,
Undefined, actually. Leads to Zero Times Infinity.
Sounds like the title of a new superhero comic.
Can’t we do zero times infinity?
There’e a zero percent chance of rain here today, according to The Weather Channel. In fact, there’s zero percent chance for the next seven days. (Of course, few actually believe the predicitions.)
Briggs, I believe that L’Hopital’s rule applies here. I also think that you want Shannon’s surprisal which is defined as -Log(p) and not the entropy.
Wasn’t that what LTCM said? A nine sigma event? Not that anything was wrong with the model!
Sorry to say Briggs but it seems to me that this post has become a bit of a mess (happens to all of us from time to time).
First of all, entropy sums over all possibilities, so for the lottery (assuming all tickets were sold), entropy is approx. 13.8. In fact, you did perform the summation for the coin flip, in my view correctly, with the answer of 0.7 approx.
With the zero probability, if it means probabilities for all events are 0, they do not sum to one. So the model is incomplete; certain events have not been specified. If let’s say, we have two mutually exclusive events with probabilities 0 and 1, then entropy is -1*log(1)-0*log(0)= 0, because log(1)= 0, and taking the limit of -p*log(p) upon p tending to 0 for 0*log(0), the sum is 0. There you go. This entropy lower than for coin flip (because: more ordered! all probability focused on one of the events).
Funny, my comment just disappeared. Will it reappear again later?
Sorry, so sorry of my impatience ….
Cees,
Aha. More conditionality differences. You’re right about a balls-in-urn type model, with a million balls just one labeled (say) ‘Win.” That’s the more usual answer, and I was stupid about not being clear.
I was instead thinking of premises which only allowed the deduction of two states, ‘Win’ and ‘Not win’. The entropy I have is correct for that. I was trying to steer clear of uniformity or symmetry for reasons that will be clear at a future date. Of course, a sort of symmetry is implied (or rather deduced!) here from urn-like thinking, which is what you say, so my explanation is, as you also say, a mess.
But zero probability is by no means impossible (get it? get it?). Given the standard “No men are over 2′ tall and Socrates is a man” the proposition “Socrates is over 2′ tall is” is 0. If we happen to observe an immortal Socrates, the jig is up! There is no limit; we are finite here. So there is (but see below) infinite surprise—if we believe the model.
You seem to be saying that impossible events are not allowed to be deduced via the model, which is now correct. We can deduce zero probability events from many models. And that doesn’t stop use from seeing “impossible events” things not predicted by a model.
Scotian,
Surprise is not the entropy (nor did I claim it was). I merely said that we can derive the “expected” surprise, which is entropy. Surprise is itself deduced from those simple premises I gave plus the premise that the surprise function is continuous. That’s the debatable part.
Homework: What arises if you drop the continuity premise and swap in a discrete finite one for the surprise function?
I also get 13.8 for the lottery example: -1e6 * (1e-6) log (1e-6) ≈ 13.8.
forgetting to sum all events: (1e-6) log (1e-6) ≈ 0.000006 ≈ 1e-5
I prefer to use base 2 logs so the answer is in bits. A coin flip comes out as H=1. Just my preference. To me it makes interpretation easier. Greater than 1 is worse than a coin flip; less than 1 is better.
Of course, there’s always inflation so maybe comparing to 0,7 is more realistic.
0 log 0 is defined = 0 for entropy calculations.
If one is making a frequency table. a zero entry often implies incomplete sampling possibly because the event is rare.
If you think of the table as parameters for a beta distribution (or dirichlet distribution if greater than 2 parameters) you are supplying 0 as one of the parameters. Is Beta(1,0) defined?
All,
Mea culpa. I was being lazy and put the entropy numbers in originally, which I don’t normally do (numbers are a distraction). I’m not in the least interested in the actual values of the entropy (which, being sloppy, I calculated with natural and not base-2 lots; see DAV’s comments).
The point I wanted to emphasize is the size of the surprise, which still holds (see Scotian’s second comment for what the deduced surprise function looks like). Expected entropy isn’t Surprise, and I didn’t make that entirely clear.
Lazy, lazy, lazy. Me, I mean.
You are right Briggs, I just realised it: from the viewpoint of the individual lottery ticket buyer, the entropy is 1.5e-5, so very close to 0.
I would say for this model, and the model of the coin flip as well, that surprise is 1/entropy.
I see your example of Socrates as an extreme form of the lottery case, very very low entropy: that is, (potential for) surprise is 1/entropy is very very big. So zero entropy corresponds to infinite surprise. And this is true whether you like or dislike limits 😉
Frankly, I think of entropy in terms of number of bits of information gained after observing an event (Shannon expressed in amount of information transmitted). The entropy of a single lottery ticket is closer to infinity than zero. If the event is certain, hearing that it occurred is zero information gain — you already knew it would.
With mutually exclusive events (e.g. H,T with a coin), P(T|H=1)=0. It would be an very large surprise indeed if the coin landed on edge or to get both heads and tails simultaneously. However, in the larger scheme of things both H and T are possible with a standard coin so a frequency table of it is two dimensional consisting of two rows and two columns, P(T|coin)=0.5.
Briggs, “Expected entropy isn’t Surprise”. No but the expected surprisal is the entropy. Maximize your uncertainty!
The information surprisal for a particular outcome is defined to be -log [Prob(outcome)]. For example, if the probability of winning a lottery is very small, then the surprisal for winning is -log(a very small probability). That is, if winning probability is very small, one would be surprised a great deal if he wins.
The log can be taken to be base 2 (bit digit), base 10 (decimal digit) or natural log, depending on the application.
There are various entropy measures. Expected surprisal (Shannon Entropy) is one of them, as Scotian has pointed out. They all have some basic properties – they can be used measure the uncertainty in a random variable (defined as in probability theory) and is maximized when uniformly distributed. If Prob(win)=1 and Prob(lose)=0, then Entropy = E(surprisal) = 1* (-log 1) + 0 (-log 0) = 0, indicating one is certain that he wins every time.
Laziness is a force of entropy!
Pingback: George Gilder’s Information Theory Of Money | William M. Briggs