This is a completion of the post I started two weeks ago which shows that “predictive” or “observational” Bayes is better than classical, parametric Bayes, which is far superior to frequentist hypothesis testing which may be worse than just looking at your data. Actually, in many circumstances, just looking at your data is all you need.
Here’s the example for the advertising.csv data found on this page.
Twenty weeks of sales data for two marketing Campaigns, A and B. Our interest is in weekly sales. Here’s a boxplot of the data.
It looks like we might be able to use normal distributions to quantify our uncertainty in weekly sales. But we must not say that “Sales are normally distributed.” Nothing in the world is “normally distributed.” Repeat that and make it part of you: nothing in the world is normally distributed. It is only our uncertainty that is given by a normal distribution.
Notice that Campaign B looks more squeezed than A. Like nearly all people that analyze data like this, we’ll ignore this non-ignorable twist—at first, until we get to observational Bayes.
Now let’s run our hypothesis test, here in the form of a linear regression (which is the same as a t-test, and is more easily made general).
|
Regression is this and nothing more: the modeling of the central parameter for the uncertainty in some observable, where the uncertainty is quantified by a normal distribution. Repeat that: the modeling of the central parameter for the uncertainty in some observable, where the uncertainty is quantified by a normal distribution.
There are two columns. The “(Intercept)” must (see the book for why) represent the central parameter for the normal distribution of weekly sales when in Campaign A. This is all this is, and is exactly what it is. The estimate for this central parameter, in frequentist theory, is 420. That is, given we knew we are in Campaign A, our uncertainty in weekly sales would be modeled by a normal distribution with best-guess central parameter 420 (and some spread parameter which, again like everybody else, we’ll ignore for now).
Nobody believes that the exact, precise value of this central parameter is 420. We could form the frequentist confidence interval in this parameter, which is 401 to 441. But then we remember that the only thing we can say about this interval is that either the true value of the parameter lies in this interval or it does not. We may not say that “There is a 95% chance the real value of the parameter lies in this interval.” The interval is, and is designed to be in frequentist theory, useless on its own. It only becomes meaningful if we can repeat our “experiment” an infinite number of times.
The test statistic we spoke of is here a version of the t-statistic (and here equals 42). The probability that if we were to repeat the experiment an infinite number of times, that in these repetitions we see a larger value of this statistic, given the premise that this central parameter equals 0, and given the data we saw and given our premise of using normal distributions is 2.7 x 10-33. There is no way to say this simpler. Importantly, we are not allowed to interpret this probability if we do not imagine infinite repetitions.
Now, this p-value is less than the magic number so we, by force of will, say “This central parameter does not equal 0.” On to the next line!
The second line represents the change in the central parameter when switching from Campaign A to Campaign B. The “null” hypothesis here, like in the line above, is that this parameter equals 0 (there is also the implicit premise that the spread parameter of A equals B). The p-value is not publishable (it equals 0.19), so we must say, “I have failed utterly to reject the ‘null’.” Which in plain English says you must accept that this parameter equals 0.
This in effect says that our uncertainty in weekly sales is thus the same for either Campaign A or B. We are not allowed to say (though most would), “There is no difference in A and B.” Because of course there are differences. And that ends the frequentist hypothesis test, with the conclusion “A and B are the same.” Even though the boxplots look like they do.
We can do the classical Bayesian version of the same thing and look at the posterior distributions of the parameters, as in this picture:
The first picture says that the first parameter (the “(Intercept)”) can be any number from -infinity to +infinity, but it is most likely between 390 to 450. That is all this says. The second picture says that the second parameter can take any of an infinite number of values but that it most likely lives between -20 and 60. Indeed, the vertical line helps us quantify the probability this parameter is less than 0 is about 9%. And thus ends the classical or parametric Bayesian analysis.
We already know everything about the data we have, so we need not attach any uncertainty to it. Our real question will be something like “What is the probability that B will be better than A in new data.” We can calculate this easily by “integrating out” the uncertainty in the unobservable parameters; the result is in this picture:
This is it: assuming just normal distributions (still also assuming equal spread parameters for both Campaigns), these are the probability distributions for values of future sales. Campaign B has higher probability of higher sales, and vice versa. The probability that future sales of Campaign B will be larger than Campaign A is (from this figure) 62%. Or we could ask any other question of interest to us about sales. What is the probability that sales will be greater than 500 for A and B? Or that B will be twice as big as A? Or anything. Do not become fixated on this question and this probability.
This is the modern, so-called predictive Bayesian approach.
Of course, the model we have so far assumed stinks because it doesn’t take into account what we observed in the actual data. First thing to change is the equal variances; second is to truncate the data to ensure no sales are less than 0. That (via JAGS; not in the book) gives us this picture:
The open circles and dark diamonds are the means of the actual and predictive data. The horizontal lines shows the range of 80% of the actual data placed at the height where there is 80% of the predictive data below. Ignore these lines if they confused you. The predictive model is close to the real data for Campaign B but not so close for Campaign A, except at the mean. This is probably because our uncertainty in A is not best represented by a normal distribution and would work better with a distribution that isn’t so symmetric.
The probability that new B Sales are larger than new A Sales is 65% (from this figure). The beauty of the observational or predictive approach is that we can ask any question of the observable data we want. Like, what’s the chance new B sales are 1.5 times new A sales? Why that’s 4%. And so on.
In other words, we can ask plain English questions of the data and be answered with simple probabilities. There is no “magic cutoff” probability, either. The 65% may be important to one decision maker and ignorable to another. To stay with A or B depends not just on this probability and this question: you can ask your own question inputting the relevant information to you. For instance, A and B may cost differently, so that you have to be sure that B has 1.5 times as many sales as A. Any question you want can be asked and asked simply.
We’ll try and do some more complex examples soon.
Yesterday: understand that if (4) is larger than the magic number you must announce, “I fail to reject the ‘null’†but you must never say, “I accept the ‘null.’â€
Today: The p-value is not publishable (it equals 0.19), so we must say, “I have failed utterly to reject the ‘null’.†Which in plain English says you must accept that this parameter equals 0.
So, my boss has given me this data, and wants a recommendation, do we move to campaign B or stay with campaign A. What do I tell him. We are not trying to get this published in any accedemic journals, but there will be reall dollars behind the decision.
My take on it is that we could say that it appears that campaign B will lead to higher sales, but the data is not particularly conclusive, or the advantage is small at best.
And, if we were in academia, we would look at this data and say “we are on the right track, but we need more data” And as we accumluate data until we have a sample such that p<0.05. After our results are published, Professor Briggs rails againts our methods that we are overconfident in our results.
One would hope. Even ignoring spread differences, both are left skewed with B even more so in the boxplot. Makes me wonder what it would take to drop the idea of normally distributed.
In real life, if I had to pick between A and B I’d go with B. It had more consistency and higher median sales. You can get that just from the boxplot further analysis is overkill assuming it wasn’t just a measurement fluke but then the analysis was to determine that, eh? Since the result of the outcome is inconclusive, I’d still go with B — it’s prettier.
But we must not say that “Sales is normally distributed.â€
To be pedantic, that is bad grammar.
Note that “campaign B has a higher probability of higher sales” is not equal to “the probability that future sales of Campaign B will be larger than Campaign A.â€
For example, define higher sales = sales > 500. Looking at the box-plot, I’d probably conclude that campaign (or stock) A has a higher probability of higher sales (returns) since campaign B hasn’t produced any sales more than 500.
So now, what is “better’? Do I want to invest in a campaign (stock) based on (1) the mean sale (return), or (2) the probability of higher sales Or (3) P(B>A). Well, I am not a risk taker, but (3) doesn’t seem to be a good strategy. Perhaps, Doug M can explain this for us.
A frequentist method. A simple calculation of the empirical probability (a U statistic) gives me an estimate of 0.64. Since the probability is unobservable, it’d be difficult to show that the simple–and-distribution-free U estimate is not better than the Bayesian estimate of 0.62!
That the parameters and probabilities are unobservable doesn’t mean they all are meaningless. Yes, it’ll be hard to verify your uncertainty assessment about the parameters because they not observable, however, poor prediction results would cast doubt on any inference of parameters.
The number of 0.05 (significance level) means that we set the probability of committing type I error at 0.05. Therefore, it’s supposed to be small, and 0.05 is commonly used. The significance level sets the rule, such as, “if the sample mean is greater than 20, then reject the null.†If set at 0.01, a different threshold will be used.
The probability that future sales of Campaign B will be larger than Campaign A is (from this figure) 62%.
We have to make a dichotomous decision here. Based on this statement, which campaign is your choice? Why? What is your magic cut-off probability?
JH
If you are picking stocks? What do the box plots represent? Suppose that they represent the return profile of asset A and asset B.
If A and B have same central tendency, but B has a lower expected volatility, then B is a more attractive asset. However, seldom do we have an either/or choice to make. Usually, the question is how much of A and how much of B to buy. So, we may look at these two assets and say, lets put 80% of of our resources behind B and 20% of our resoureces behind A. But even more likely would be to say that we will put 8% of our resources behind B and 2% of our resources behind A.
Doug M: you could determine a near optimal hedge by monte carloing the snot out of it and picking the one the made you the most money.
OK, the predictive Bayesian approach is, well, more approachable than frequentist contortions. So tell me the practical procedure for posing and answering any questions I may have such as “What’s the chance that four years of college education has turned callous freshpersons into thoughtful seniors as measured by their responses to question X?”
I tried to follow this article, I really did.
I find it unfortunate that I can read an article on particle physics and can generaly follow the ideas behind some very esoteric physics. Yet probability is so “out there”, that I can’t make head-nor-tail of what it is you’re even talking about. Then again, I could not for the life of me understand electricity at high school physics either (I think the two are related).
Thanks Mr. Briggs; for your ever intriguing take on things, and for spending your time to put your thoughts into print for our entertainment (and possible education).
There’s no magic cutoff in freq stats either. You should choose alpha based on your special case. Ie for how many #of heads observed would you declare a coin not fair based on evidence from 100 flips?
And no infinite number of trials is needed, that is quite a silly strawman. Infinite is just the theory (like limits in calculus- do not need actual infinite number of infinitely skinny rectangles to calculate areas under curves).
Justin