William M. Briggs

Statistician to the Stars!

Page 5 of 559

How Good Is That Model? Scoring Rules For Forecasts: Part II: Update


Read Part I

Part II

What we’re after is a score that calculates how close a prediction is to its eventual observation when the prediction is a probability. Now there are no such things as unconditional probabilities (see the Classic Posts page at the upper right for dozens of articles on this subject). Every probability is thus conditional on some evidence, i.e. premises. This evidence is called the model. Assuming no mistakes in computation or measurement (which, unfortunately, are not as rare as you’d like), the probability put out by the model is correct.

The probability is correct. But what about the model? If the model itself were correct, then every probability it announced would be 0 or 1, depending on whether the propositions put to it were deduced as true or false. Which is to say, the model would be causal and always accurate. We thus assume the models will not be correct, but its probabilities are.

Enter scoring rules. These are some function of the prediction and outcome, usually a single number where lower is better (even when the forecast is a vector/matrix). Now there is a bit of a complexity involving proper scoring rules (a lot of fun math), but which don’t make much real-life difference. A proper score is defined as one which (for our purposes) is “honest”. Suppose our scoring rule awarded all forecasts for probabilities greater than 0.9 a 0 regardless of the outcome, but used (probability – I(event occurred))^2 for probabilities 0.9 or less (the I() is an indicator function). Obviously, any forecaster seeking a reward would forecast all probabilities greater than 0.9 regardless what his model said. (The analogy to politics is obvious here.) What we’re after are scores which are “fair” and which accurately rate forecast performance. These are called proper.

A common score for dichotomous outcomes is as above: (probability – I(event occurs))^2, or its mean in a collection of forecast-observation pairs. This is the Brier score, which is proper. The score is symmetric, meaning the penalty paid for having large probabilities and no outcome equals the penalty paid for having small probabilities and an outcome. Yet for many decisions, there is an asymmetry. You’d most likely feel less bad about a false positive on a medical test for pancreatic cancer than for a false negative—though it must be remembered that false positives are not free. Even small costs for false positives add up when screening millions. But that is a subject for another time.

The Brier score is thus a kind of compromise when the actual decisions made based on forecasts for dichotomous events aren’t known. If the costs are known, and they are not symmetric, the Brier score holds no meaning. Neither does any score which doesn’t match the decisions made with the forecast! (I keep harping on this hoping it won’t be forgotten.)

There are only main reasons to score a model. (1) To reward a winner: more than one model is in competition and something has to judge between them. (2) As a way to understand particularities of model performance: suppose our model did well—it had high probabilities—when events happened, but it did poorly—it still had high-ish probabilities—when events didn’t happen; this is good to know. (3) As a way to understand future performance.

The first reason brings in the idea of skill scores. You have a collection of temperature model forecasts and its proper score; say, it’s a 7. What does that 7 mean? What indeed? If your score was relevant to the decisions you made with the forecast, then you wouldn’t ask. It’s only because we’re releasing a general-purpose model into the wild which may be used for any number of different reasons that we must settle on a compromise score. That score might be the complete rank probability score (CRPS; see Gneiting and Raftery, Strictly Proper Scoring Rules, Prediction, and Estimation, 2007, JASA, 102, pp 359–378), which has many nice properties. This takes a probability forecast for a thing (which itself gives the probability of every possible thing that can happen) and the outcome and measures the “distance” between them (not all scores are distances in the mathematical sense). It gives a number. But how to rate it?

There has to be a comparator, which is usually a model which is known to be inferior in some aspect. Take regression (see the Classic Posts for what regression really is; don’t assume you know, because you probably don’t). A complex model might have half a dozen or more “x” variables, the things we’re using to predict the “y” variable or outcome with. A naive model is one with no x variables. This is the model which always says the “y” outcome will be this-and-such fixed probability distribution. Proper scores are calculated for the complex and naive model and their difference (or some function of their difference) is taken.

This is the skill score. The complex model has skill with respect to the simpler model if it has a better proper score. It’s usually worked so that skill scores greater than 0 indicate superiority. If a complex model does not have skill with respect to the simpler model, the complex model should not be used. (Unless, of course, your real-life decision score is completely different than the proper score underlying skill.)

We’ll wrap it up in Part III.

Update Here is an example of an improper score in real life. Tanking in sports.


Pascal’s Pensées, A Tour: V

PascalSince our walk through Summa Contra Gentiles is going so well, why not let’s do the same with Pascal’s sketchbook on what we can now call Thinking Thursdays. We’ll use the Dutton Edition, freely available at Project Gutenberg. (I’m removing that edition’s footnotes.)

Previous post.

Now that the hacking is all sorted out, we’re back to our regularly scheduled program. Thinking Thursdays!

14 When a natural discourse paints a passion or an effect, one feels within oneself the truth of what one reads, which was there before, although one did not know it. Hence one is inclined to love him who makes us feel it, for he has not shown us his own riches, but ours. And thus this benefit renders him pleasing to us, besides that such community of intellect as we have with him necessarily inclines the heart to love.

Notes This is so, but then so is this: “For the time will come when they will not endure sound doctrine; but after their own lusts shall they heap to themselves teachers, having itching ears; And they shall turn away their ears from the truth, and shall be turned unto fables” (King James). Also see this translation: “For the time will come when people will not tolerate sound doctrine but, following their own desires and insatiable curiosity, will accumulate teachers and will stop listening to the truth and will be diverted to myths.”

15 Eloquence, which persuades by sweetness, not by authority; as a tyrant, not as a king.

16 Eloquence is an art of saying things in such a way—(1) that those to whom we speak may listen to them without pain and with pleasure; (2) that they feel themselves interested, so that self-love leads them more willingly to reflection upon it.

It consists, then, in a correspondence which we seek to establish between the head and the heart of those to whom we speak on the one hand, and, on the other, between the thoughts and the expressions which we employ. This assumes that we have studied well the heart of man so as to know all its powers, and then to find the just proportions of the discourse which we wish to adapt to them. We must put ourselves in the place of those who are to hear us, and make trial on our own heart of the turn which we give to our discourse in order to see whether one is made for the other, and whether we can assure ourselves that the hearer will be, as it were, forced to surrender. We ought to restrict ourselves, so far as possible, to the simple and natural, and not to magnify that which is little, or belittle that which is great. It is not enough that a thing be beautiful; it must be suitable to the subject, and there must be in it nothing of excess or defect.

Notes Point (2) does not say something good about listeners. While it’s true the speaker has a duty to ease pain, the listener is not excused labor. If he is, we’re back to tickling ears. Anyway, it’s clear that when Pascal said, “We ought to restrict ourselves, so far as possible, to the simple and natural, and not to magnify that which is little, or belittle that which is great” he proved that he would not have been a hit on the Internet. I’m also reminded of the late philosopher David Stove’s lament “You or I might perhaps be excused if we sometimes toyed with solipsism, especially when we reflect on the utter failure of our writings to produce the smallest effect in the alleged external world.” From “Epistemology and the Ishmael Effect.”

A reminder that we’re skipping some points, like 17, which states rivers are moving roads.

18 When we do not know the truth of a thing, it is of advantage that there should exist a common error which determines the mind of man, as, for example, the moon, to which is attributed the change of seasons, the progress of diseases, etc. For the chief malady of man is restless curiosity about things which he cannot understand; and it is not so bad for him to be in error as to be curious to no purpose.

The manner in which Epictetus, Montaigne, and Salomon de Tultie wrote, is the most usual, the most suggestive, the most remembered, and the oftenest quoted; because it is entirely composed of thoughts born from the common talk of life. As when we speak of the common error which exists among men that the moon is the cause of everything, we never fail to say that Salomon de Tultie says that when we do not know the truth of a thing, it is of advantage that there should exist a common error, etc.; which is the thought above.

Notes Montaigne would have made a great blogger. Epictetus, who did not publish, would have been hired by either a faithless university or some White House administration and then, at some point, abandoned when he went on quip too far. Incidentally, Salomon de Tultie is Pascal’s nom de plume, and Salomon is the French of Solomon. But what about that curious “it is of advantage that there should exist a common error”? The analogy I see is that every ship has only one captain. It is better sailing when all are in one accord (whether openly or not) then to have many hands pointing in different directions. This doesn’t preserve all ships from foundering, but it does most. And morale is better.


What Might Pope Francis’s Upcoming Encyclical Look Like?

For the love of money is the root of all evil: which while some coveted after, they have erred from the faith, and pierced themselves through with many sorrows.

So said St Paul in his first letter to Timothy, and human history is loaded with evidence confirming this view. Latterly, I say, money has been replaced in part by Theory. Pope Francis thinks Inequality. Which, he said, is the “fruit of the law of competitiveness that means strongest survive over the weak” which is the “logic of exploitation” and “waste”.

Or so he said in Italian to a group in Milan, his words translated by Vatican Insider. There is thus the very real danger here and elsewhere of missing nuances and even of incorrect wordings. So let’s tread carefully.

It is necessary, if we really want to solve problems and not get lost in sophistry, to get to the root of all evil which is inequity. To do this there are some priority decisions to be made: renouncing the absolute autonomy of markets and financial speculation and acting first on the structural causes of inequity.

Obviously, or at least I hope obviously, you cannot push the “strongest survive over the weak” metaphor too far. Neither “inequality.” If there were absolute equality, where the weak and strong are as one, there would be no Pope and no right or wrong ideas. Neither could there be politicians in charge to renounce absolute autonomy of markets or of anything else.

Incidentally, we musn’t form a USA-centric view of the Pope’s words. Here, for instance, the markets are very much tied to government, the executives of one are the executives of the other. Market leaders assist (if I may be allowed the euphemism) the government in fashioning laws and regulations to their mutual benefit.

The Pope is interested in the kind of inequality that causes some of the world to go hungry. “[T]he number one concern must be for the actual person, how many people lack food on a daily basis and have stopped thinking about life, about family and social relationships, just fighting to survive?” And here comes the kicker:

“Despite the proliferation of different organizations and the international community on nutrition, the ‘paradox’ of John Paul II still stands.” There is food for everyone, but not everyone can eat” while “at the same time the excessive consumption and waste of food and the use of it for other means is there before our eyes.”

Despite? Is that the right word? But he’s right about waste. The amount of food we toss out would have scandalized our ancestors. My maternal grandfather was fond of saying, and of enforcing, “Take what you want, but eat what you take.”

In a different venue (also translated), Pope Francis said that humans should think of themselves as lords but not masters of creation. This strikes me as accurate. In charge but restrained by natural law. The danger to those who slaver or fume over the Pope’s environmental words lies in thinking our environmental policy must consist in jumping from wanton disregard to unthinking worship. We dearly love a false dichotomy.

A Christian who does not protect Creation, who does not let it grow, is a Christian who does not care about the work of God, that work that was born from the love of God for us. And this is the first response to the first creation: protect creation, make it grow.

And from the Milan speech (with choppy translation grammar):

The earth is entrusted to us so it may be a mother to us, capable of sustaining each one of us. Once, I heard a beautiful thing: the earth is not a legacy that we have received from our parents rather it is on loan to us from our children, so that we safeguard it, nurture it and carry it forward for them. The earth is generous will never leave those who custody it lacking. The earth, which is the mother for all, demands our respect and non-violence or worse the arrogance the masters. We have to pass it on to our children improved, guarded, because it was a loan that they have given to us.

You have to read your own (right or left) political desires into this to have any policy of consequence flow from it. No definite directives can be implied from the Pope’s words. One cannot, for instance, argue that thus a carbon tax must follow. Neither can you say (which nobody does say) you can do whatever you want.

But many think or hope they can “leverage” the Pope to further their politics. Even now “eco-ambassadors” are flowing in great numbers to Rome to have a photo-op (secular blessing) because they are sure the Pope’s upcoming encyclical can be used by them as a bludgeon. They want in on what they are sure will be a good thing. We’ll see.


How Good Is That Model? Scoring Rules For Forecasts: Part I


Part I of III

All probability (which is to say, statistical) models have a predictive sense; indeed, they are only really useful in that sense. We don’t need models to tell us what happened. Our eyes can do that. Formal, hypothesis testing, i.e. chasing after statistical “significance”, leads to great nonsense and the cause of many interpretational errors. We need models to quantify the uncertainty of what has not yet been measured or made known to us. Throughout this series I take models in that sense (as all should).

Which is this. A model—a set of premises—is used to make predictions about some observable Y, a proposition. For example, a climate model might predict what the (operationally defined) global mean surface temperature will be at some time, and Y is the proposition “The temperature was observed at the time to be y”. What I have to say applies to all probability models of observable events. But I’ll use temperature as a running example because of its familiarity.

If a model said “The temperature at the time will be x” but was really y, then the model has been falsified. The model is not true. Something is wrong with the model. The model said x would occur but y did. The model is falsified because it implied x would happen with certainty. Now the model may have always hit at every time up to this point, and it may continuing hitting forever after, but it missed this time and all it takes is one mistake for a model to be falsified.

Incidentally, any falsified model must be tossed out. By which I mean that it must be replaced with something new. If any of the premises in a model are changed, even the smallest least consequential one, strictly the old model becomes a new one.

But nobody throws out models for small mistakes. If our model predicted accurately every time point but one we’d be thrilled. And we’d be happy if “most of the time” our forecasts weren’t “too far off.” What gives? Since we don’t reject models which fail a single time or are not “too far off”, there must be hidden or tacit premises to the model. What can these look like?

Fuzz. A blurring that takes crystalline predictions and adds uncertainty to them, so that when we hear “The temperature will be x” we do not take the words at their literal meaning, and instead replace them with “The temperature will be about x”, where “about” is happily left vague. And this is not a problem because not all probability is (or should be!) quantifiable. This fuzz, quantified or not, saves the model from being falsified. Indeed, no probability model can ever be falsified unless that model becomes (at some point) dogmatic and says “X cannot happen” and we subsequently observe X.

Whether the fuzzy premises—yes, I know about fuzzy logic, the rediscovery and relabeling of classic probability, keeping all the old mistakes and adding in a few new ones—are put there by the model issuer or you is mostly irrelevant (unless you’re seeking whom to blame for model failure). The premises are there and keep the models from suffering fatal epistemological blows.

Since the models aren’t falsified, how do we judge how good they are? The best and most basic principle is how useful the models were to those who relied upon them. This means a good model to one person can be a poor one to another. A farmer may only care whether temperature predictions were accurate at distinguishing days below freezing, whereas the logistics manager of a factory cares about exact values for use in ordering heating oil. An environmentalist may only care that the forecast is one of doom while being utterly indifferent (or even hostile) to the actual outcome, so that he can sell his wares. The answer to “What makes a good model” is thus “it depends.”

Of course, since many decisions fall into broad categories we can say a little more. But in so saying, we must always remember that goodness depends on actual use.

Consider the beautiful game of petanque, wherein manly steel balls are thrown towards a target. Distance to the target is the measure of success. The throw may be thought of as a model forecast of 0 (always 0) and the observation the distance to the target. Forecast goodness is taken as that distance. Linear distance, or its average over the course of many forecasts, is thus a common measure of goodness. But only for those whose decisions are a linear function of the forecast. This is not the farmer seeking frost protection. Mean error (difference between forecast and observation) probably isn’t generally useful. One forecast error of -100 and another of +100 average to 0, which is highly misleading—but only to those who didn’t use the forecasts!

You can easily imagine other functions of error as goodness measures. But since our mathematical imagination is fecund, and since there are an infinite number of functions, there will be no end to these analyses, a situation which at least provides us with an endless source of bickering. So it might be helpful to have other criteria to narrow our gaze. We also need ways to handle the fuzz, especially when it has been formally quantified. That’s to come.

Update Due to various scheduling this-and-thats, Part II of this series will run on Friday. Part III will run either Monday or Tuesday.

« Older posts Newer posts »

© 2015 William M. Briggs

Theme by Anders NorenUp ↑