Skip to content

Category: Class – Applied Statistics

November 28, 2017 | 6 Comments

Free Data Science Class: Predictive Case Study 1, Part III

You must review: Part I, II. Not reviewing is like coming to class late and saying “What did I miss?” Note the New & Improved title!

Here are the main points thus far: All probability is conditional on the assumptions made; not all probability is quantifiable or must involve observables; all analysis must revolve on ultimate decisions; unless deduced, all models (AI, ML, statistics) are ad hoc; all continuum-based models are approximations; and the Deadly Sin of Reification lurks.

We are using the data from Uncertanity, so that those bright souls who own the book can follow along. We are interested in predicting the college grade point of certain individuals at the end of their first year. We spent two sessions defining what we mean by this. We spend more time now on this most crucial question.

This is part of the process most neglected in the headlong rush to get to the computer, a neglect responsible for vast over-certainties.

Now we learned that CGPA is a finite-precision number, a number that belongs to an identifiable set, such as 0, 0.0625, and so on, and we know this because we know the scoring system of grades and we know the possible numbers of classes taken. The finite precision of CGPA can be annoyingly precise. Last time we were out at six or eight decimal places, precision far beyond any decision (except ranking) I can think to make.

To concentrate on this decision I put myself in the mind of a Dean—and immediately began to wonder why all my professors aren’t producing overhead. Laying that aside (but still sharpening my ax) I want to predict the chance any given student will have a CGPA of 0, 1, 2, 3, or 4. These buckets are all I need for the decision at hand. Later, we’ll increase the precision.

Knowing nothing except the grade must be one of these 5 numbers, the probability of a 4 is 1/5. This is the model:

    (1) Pr(CGPA = 4 | grading rules),

where “grading rules” is a proposition defining how CGPAs are calculated, and with information of what level of precision that is of interest to us, and possibly to nobody else; “grading rules” tells us CGPA will be in the buckets 0, 1, 2, 3, 4, for instance.

The numerical probability of 1/5 is deduced on the assumptions made; it is therefore the correct probability—given these assumptions. Notice this list of assumptions does not contain all the many things you may also know about GPAs. Many of these bytes of information will be non-quantified and unquantifiable, but if you take cognisance of any of them, they become part of a new model:

    (2) Pr(CGPA = 4 | grading rules, E),

where E is a compound proposition containing all the semi-formal and informal things (evidence) you know about GPAs, like e.g. grade inflation. This model depends on E, and thus (2) will not likely give quantified or quantifiable answers. Just because our information doesn’t appear in the formal math does not make (2) not a model; or, said another way, our models are often much more than the formal math. If, say, E is only loose notions on the ubiquity of grade inflation, then (2) might equal “More than a 20% chance, I’ll tell you that much.”

To the data

We have made use of no observations so far, which proves, if it already wasn’t obvious, that observations are not needed to make probability judgments (which is why frequentism fails philosophically), and that our models are often more reliant upon intelligence not contained in (direct) observation.

But since this is a statistics-machine learning-artificial intelligence class, let’s bring some numbers in!

Let’s suppose that the only, the sole, the lone observation of past CGPAs was, say, 3. I mean, I have one old observation of CGPA = 3. I want now to compute

    (3) Pr(CGPA = 4 | grading rules, old observation).

Intuitively, we expect (3) to decrease from 1/5 to indicate the increased chance of a new CGPA = 3, because if all we saw was an old 3, there might be something special about 3s. That means we actually have this model, and not (3):

    (4) Pr(CGPA = 4 | grading rules, old observation, loose math notions).

There is nothing in the world wrong with model (4); it is the kind of mental model we all use all the time. Importantly, it is not necessarily inferior to this new model:

    (5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions),

where we move to formally define how all the parts on the right hand side mathematically relate to the left hand side.

How is this formality conducted?

Well, it can be deduced. Since CGPA can belong only to a fixed, finite set (as “grading rules” insists), we can deduce (5). In what sense? There will be so many future values we want to predict; out of (say) 10 new students, how many As, Bs, etc. are we likely to see and with what chance? This is perfectly doable, but it is almost never done.

The beautious (you heard me: beautious) thing about this deduction is that no parameters are required in (5) (nor are any “hidden layers”, nor is any “training” needed). And since no parameters are required, no “priors” or arguments about priors crop up, and there is no need of hypothesis testing, parameter estimates, confidence intervals, or p-values. We simply produce the deduced probabilities. Which is what we wanted all along!

In Uncertainty, I show this deduction when the number of buckets is 2 (here it is 5). For modest n, the result is close to a well-known continuous-parameterized approximation (with “flat prior”), an approximation we’ll use later.

Here (see the book or this link for the derivation) (5) as an approximation works out to be

    (5) Pr(CGPA = 4 | GR, n_3 = 1, fixed math) = (1 + n_4)/(n + 5),

where n_j is the number of js observed in the old data, and n is the number of old data points; thus the probability of a new CGPA = 4 is 1/6; for a new CGPA = 3 it is 2/6; also “fixed math” has a certain meaning we explore next time. Model (5), then, is the answer we have been looking for!

Formally, this is the posterior predictive distribution for a multinomial model with a Dirichlet prior. It is an approximation, valid fully only at “the limit”. As an approximation, for small n, it will exaggerate probabilities, make them sharper than the exact result. (For that exact result for 2 buckets, see the book. If we used the exact result here the probabilities for future CGPAs would with n=1 remain closer to 1/5.)

Now since most extant code and practice revolves around continuous-parameterized approximations, and we can make do with them, we’ll also use them. But we must always keep in mind, and I’ll remind us often, that these are approximations, and that spats about priors and so forth are always distractions. However, as the approximation is part of our right-hand-side assumptions, the models we deduce are still valid models. How to test which models worked best in our decision is a separate problem we’ll come to.

Homework: think about the differences in the models above, and how all are legitimate. Ambitious students can crack open Uncertainty and use it to track down the deduced solution for more than 2 buckets; cf. page 143. Report back to me.

September 18, 2017 | 9 Comments

Signal + Noise vs. Signal — Important Update

If we imagine these are atmospheric concentrations or stock price anomalies, this is a terrific example of reification, or replacing what did happen with what did not.

Update I see that I failed below to demonstrate the ubiquity of the problem. So your homework is to search “testing trend time series” and similar terms and discover for yourself. Any kind of hypothesis test used on a time series counts.

My impetus was in reading an article about a paper some colleagues and I wrote about atmospheric ammonia. The author wrote, “The statistical correlation between hourly ammonia concentrations between measurement stations is weak due to large variability in local agricultural practice and in weather conditions. If data are aggregated to longer time scales, correlations between stations clearly increase due to the removal of noise at the hourly timescale.”

There’s the belief in “noise”, which does not exist, and there’s also the second (bigger) mistake, which is measuring correlation of time series after smoothing, which increases (in absolute value) the correlation (as has been proved here and Uncertainty_ many, many, many times). This happens even for two strings of absolutely unrelated, made-up numbers. Try it yourself.

So you just look for mentions of “noise” in stock prices, and so on and see if I’m right about the scale of the problem.

Original article

Two weeks ago the high temperature on the wee island upon which I live was 82F (given my extreme sloth, I am making all details up).

Now for the non-trick question: What was the high temperature experienced by those who went out and about on that day?

If you are a subscriber to the signal+noise form of time series modeling, then your answer might be 78F, or perhaps 85F, or even some other figure altogether. But if you endorse the signal form of time series modeling, you will say 82F.

Switch examples. Three days back, the price of the Briggs Empire stock closed at $52 (there is only one share). Query: what was the cost of the stock at the close of the day?

Signal+noise folks might say $42.50, whereas signal people will say $52.

Another example. I was sitting at the radio AM DXing, pulling in a station from Claxton, Georgia, WCLA 1470 AM. The announcer came on and through the heavy static I thought I heard him give the final digit of a phone number as “scquatch”, or perhaps it was “hixsith”.

Here are two questions: (1) What number did I hear? (2) What number did the announcer say?

The signal+noise folks will hear question (1) but give the answer to (2) (they will answer (2) twice), whereas the signal folks will answer (1) with “scquatch or hixsith”, and answer (2) by saying, “Hey signal+noise guys, a little hand here?”

We have three different “time series”: temperature, stock price, radio audio. It should be obvious that everybody experiences the “numbers” or “values” of each of these series as they happen. If it is 82F outside, you feel the 82F and not another number (and don’t give me grief about fictional “heat indexes”); if the price is $52, that is what you will pay; if you hear “scquatch”, that is what you hear. You do not experience some other value to which ignorable noise has been added.

For any time series (and “any” include our three), some thing or things caused each value. A whole host of physical states caused the 82 degrees; the mental and monetary states of a host of individuals caused the $52; a man’s voice plus antenna plus myriad other physical states (ionization of certain layers of the atmosphere, etc.) caused “scquatch” to emerge from the radio’s speakers.

In each case, if we knew—really knew—what these causes were, we would not only know the values, which we already knew because we experienced them, but we could predict with certainty what the coming values would be. Yet this list of causes will really only be available in artificial circumstances, such as simulations.

Of the three examples, there was only one in which there was a true signal hidden by “noise”, where noise is defined as that which is not signal. Temperature and stock price were pure signal. But all three are routinely treated in time series analysis as if they were composed of signal+noise. This mistake is caused by the Deadly Sin of Reification.

No model of any kind is needed for temperature and stock price; yet models are often introduced. You will see, indeed it is vanishingly rare not to see, a graph of temperature or price over-plotted with a model, perhaps a running-mean or some other kind of smoother, like a regression line. Funny thing about these graphs, the values will be fuzzed out or printed in light ink, while the model appears as bold, bright, and thick. The implication is always that the model is reality and values a corrupted form of reality. Whereas the opposite is true.

The radio audio needs a model to guess what the underlying reality was given the observed value. We do not pretend in these models to have identified the causes of the reality (of the values), only that the model is conditionally useful putting probabilities on possible real values. These models are seen as correlational, and nobody is confused. (Actual models, depending on the level of sophistication, may have causal components, but since the number of causes will be great in most applications, these models are still mostly correlational.)

We agreed there will be many causes of temperature and stock price values. One of the causes of temperature is not season—how could the words “autumn” cause a temperature?—though we may condition on season (or date) to help us quantify our uncertainty in values. Season is not a cause, because we know there are causes of season, and that putting “season” (or date) into a model is only a crude proxy for knowledge of these causes.

Given an interest in season, we might display a model which characterizes the average (or some other measure) of uncertainty we might have in temperature values by season (or date), and from this various things might be learned. We could certainly use such a model to predict temperature. We could even say that our 82F was a value so many degrees higher or lower than some seasonal measure. But that will not make the 82F less real.

That 82F was not some “real” seasonal value corrupted by “noise”. It cannot be because season is not a cause: amount of solar insolation, atmospheric moisture content, entrainment of surrounding air, and on and on are causes, but not season.

Meteorologists do attempt a run at causes in their dynamic models, measuring some causes directly and others by proxy and still others by gross parameterization, but these dynamical models do not make the mistake of speaking of signal+noise. They will say the temperature was 82F because of this-and-such. But this will never be because some pure signal was overridden by polluting noise.

The gist is this. We do not need statistical models to tell us what happened, to tell us what values were experienced, because we already know these. Statistical models are almost always nothing but gross parameterization and are thus only useful in making predictions, thus they should only be used to guess the unknown. We certainly do not need them to tell us what happened, and this includes saying whether a “trend” was observed. We need only define “trend” and then just look.

Why carp about this? Because the signal+noise view brings in the Deadly Sin of Reification (especially in stock prices, where everybody is an after-the-fact expert), and that sin leads to the worse sin of over-certainty. And we all know where that leads.

Addendum

“But, Briggs. What if we measured temperature with error?”

Great question. Then we are in the radio audio case, where we want to guess what the real values were given our observation. There will be uncertainty in these guesses, some plus-or-minus to every supposed value. This uncertainty must always be carried “downstream” in all analyses of the values, though it very often isn’t. Guessing temperatures by proxy is a good example.

I have more on this topic in Uncertainty: The Soul of Modeling, Probability & Statistics.

September 1, 2017 | 5 Comments

Taleb’s Curious Views On Probability — Part III: Ergodicity & All That

Read Part I, Part II

Ergodic in probability has a technical definition. Without going into mathematical details (which are fine except possibly when applied), a “sequence” is defined as a run of measurements of some observable. A sub-sequence is a portion of the sequence.

Here is where belief that probability is ontic causes trouble. First, no real sequence is of infinite length, thus no sub-sequence can be infinite. The observations are measurements, as said, of real things, say, stock prices. The measurements do not possess any properties beyond those in the things themselves, i.e. prices of stocks. The measurements do not have a mean in the sense of a parameter from a probability model; of course, arithmetic averages can be calculated from any observed sequence. But the measurements do not possess any parameter from any probability distribution that may be used to represent uncertainty in them. The measurements do not possess probability. This we learned in Part I.

With me?

Ergodic, or ergodicity, is the property that any sub-sequence in the measurements possess the same probability characteristics of the entire sequence, or other sub-sequences. Since none of any real sequence possess any probability characteristics in any ontic sense, the term is of no use in reality, however useful it might be in imagining infinite sequences of mathematical objects.

We might find some use for ergodicity, rescue it as it were, in the following way. A set of assumptions M, i.e. a model, is used to make predictions of a sequence up to some point t. After t, we might amend these assumptions, to say Mt, and make new predictions. Why this change at t? Only because there is some new assumption (or observation etc.) which impinged upon your mind.

Example: Use M for stock price y; at time t, the stock splits, and so M is amended to Mt to incorporate knowledge of the split. If M ever changes (because your assumptions, premises, etc. do), however often, through time, in practice we do not have ergodicity. In this sense, ergodicity is just like probability in being purely epistemic. But since we know we changed M, we don’t need to label that change “ergodic activity at time t”.

Make sense?

Of course, since real sequences do not possess, in the ontic sense, ergodicity, there is no point in going and looking for it. You cannot find what doesn’t exist. For real sequences, you are always welcome to change your assumptions at any time. In this sense, it is you that creates practical ergodicity when you change M, which is how you know it’s there.

How do you know to change M? How indeed! That is ever the problem. There is no universal solution, save discovering the causes or y (which for stock prices isn’t going to happen).

Back to Taleb. His use of the term appears to assume the mathematical definition, which says probability exists; e.g. he says things like “detect when ergodicity is violated”. This is not only Taleb, of course, but most users of probability models. The error is common. It is why Taleb’s examples about ergodicity aren’t quite coherent. But it’s not his fault.

Switch to our last topic, repetition of exposure. This allows Taleb to run back to the precautionary principle he loves so well.

If one claimed that there is “statistical evidence that the plane is safe”, with a 98% confidence level (statistics are meaningless without such confidence), and acted on it, practically no experienced pilot would be alive today. In my war with the Monsanto machine, the advocates of genetically modified organisms (transgenics) kept countering me with benefit analyses (which were often bogus and doctored up), not tail risk analyses for repeated exposures.

Only frequentist statistics need confidence (and all readers of Uncertainty know the frequentist theory fails on multiple fronts, and is useful nowhere). Predictive probability does not.

It is true, and obvious, that if there is a risk in an act, repeating the act increases the overall risk.

What risk is there in, say, eating a GMO BLT? I have no idea, and neither does Taleb. There are well known benefits, though, as there always are when bacon is involved. If I knew of any risk, it may be that the cumulative benefits outweigh those cumulative risks. But I know of no risks save that “GMOs might hurt me”.

That statement is actually a tautology: it is equivalent to “GMOs might hurt me and they might not hurt me.” It therefore as the assumption to a model of S = “GMOs will hurt me” of no use. Tautologies never add information; they are like multiplying by 1. S does not have a probability without assumptions.

I might, as Taleb likes to do in the precautionary principle (review!), use different assumptions, say, “Monsanto’s lawyers are jerks and their GMOs cause, when the circumstances are in place, small amounts of damage when eaten.” With that, we can form a medium to high probability that S is true, especially upon repeated exposure (it would certainty and not only high probability except for that “circumstances” condition).

Now Monsanto’s lawyers are jerks. Suing because Monsanto’s DNA wanders via natural pollination into some poor innocent farmer’s field is evil and shouldn’t be allowed. But from these truths it does not follow Monsanto’s GMOs cause harm. You need more than just suspicions that they might cause harm, because “might” is a tautology.

It’s enough for Taleb, because he wants you to consider not only the harm that GMOs (or global warming) will cause you, but will cause all of humanity plus its pet parakeets. Yet he offers (as far as I can see) nothing more than the tautology as evidence for S, and however many times you multiply a tautology, it is still a tautology in the end. A thousand “might harms” is still one “might or might not harm”.

If you are determined to prove GMOs cause harm, you need to demonstrate how. And then you still haven’t demonstrated that the benefits of them outweigh these harms. There will be no one-size-fits-all decision there.

August 29, 2017 | 4 Comments

Taleb’s Curious Views On Probability — Part II: Skin in the Game

Skin in the game

Read Part I

It is in one sense fortunate that the mathematical, or rather quantitative, roots of probability began with gambling. Routine gambles are easy to understand, and the calculations not only easy, but as models have great applicability to actual events. All know the story of how quantitative probability flourished, and flourishes, from these beginnings.

On the other hand, it has been difficult for probability to remember that its more robust, fuller, and certainly more supportive roots which are non-quantitative. That gambles were easily quantifiable and made skillful models produced the false idea that all probability is, or should be, quantitative. And this led to the main error, discussed last time, that probability exists. It also produced a second error, which I won’t examine here (but have at length in Uncertainty), that probability is subjective.

Given the rules of craps—our premises—we can deduce the probability of winning and losing. We can also apply this model to real dice. And the same is true for card games, slot machines, and so on. These models have been found to work well. But even casinos change out worn dice and bent cards knowing the models are no longer as applicable.

These models work well for single gamblers (with assumed fortunes), but they cannot be applied to groups of gamblers, because how much and how long people, plus how many people, gamble cannot be captured by the simple premises. Here I agree with Taleb when he says about groups of gamblers, “Some may lose, some may win, and we can infer at the end of the day what the [casino’s] ‘edge’ is, that is, calculate the returns simply by counting the money left with the people who return.” This observational data is used to infer premises for a model beyond the premises available per game (which are easy).

Taleb continues: “We can thus figure out if the casino is properly pricing the odds.” The odds for each single game are deduced, so that means, at first glance, that the overall odds are also correct. But sometimes it pays for casinos to change single-game odds. If few wins at some slot machine, few will use it (after word spreads); likewise, if one pays off well, more will use it. Observed behavior can help slide the single-game deduced odds to entice more gambles. Since behavior is volatile, so will be these models.

I also—everybody also—agrees with Taleb that when a gambler goes but he must stop playing. For some reason he calls going bust an “uncle point” (crying uncle?). Everybody also knows that because a certain gambler reaches an “uncle point”, that other gamblers might still have money. This seems to be something of a revelation to Taleb, though, who calls the models applied to groups of gamblers “ensemble probability” models, and those applied to single gamblers (with known or assumed fortunes) “time probability” models.

Taleb then argues, what isn’t a secret, that sometimes people use the wrong model. They’ll use a single-gambler model for a market (group), and a group model for a single-gambler. I don’t think this often happens, however, not with stocks, anyway, with so much money involved.

He says, “I effectively organized all my life around the point that sequence matters and the presence of ruin does not allow cost-benefit analyses; but it never hit me that the flaw in decision theory was so deep.”

Well, of course, the presence of ruin, i.e. if one is ruined, the cost-benefit is not flawed, it is as easy as can be. That that possibility of ruin exists does not conceal a flaw in decision theory, either.

I agree that decision theory has many flaws, but I see them differently. Many formal quantitative methods allow for impossible values (infinities or other large numbers), or they assume probabilities are real or they conflate probability and decision. Probability is not decision.

Taleb is concerned with “tails”, which is to say, large values. Now actual observed large values may or may not be well modeled; often they are not, and then Taleb’s criticism is spot on. For instance, normal distributions are as overused as the word “like” is in ordinary conversation. Other times there are possibilities in decision analysis for “tail” values that can’t be seen, and that’s a flaw with either the probability model or decision criterion (or both).

Somehow Taleb believes people, unless they possess genius, cannot figure probability if they do not have “skin in the game”, his favorite marketing phrase. This is false, as is obvious. People who do not give a rat’s rear about an outcome are less likely to attend to the problem as closely as those who do care, which is clear enough. But having money on the line does not bring the psychic gift of probability awareness. Indeed, gamblers with much “skin in the game” are apt to be the worst estimators.

That’s enough for Part II. I’ll wrap it up in Part III, Ergodicity and all that.