# Homework #1: Answer part II

In part I, we learned that all surveys, and in fact all statistical models, are valid only conditionally on some population (or information). We went into nauseating detail of the conditional information on our own survey of people who wear thinking suppression devices (TSDs; see the original posts), so I’ll skip repeating any of it again.

Today, we look at the data and ignore all other questions. The first matter we have to understand is: what are probability models and statistics for? Although we use the data we just observed to fit these models, they are not for that data. We do not need to ask probability questions of the data we just observed, there is no need to. If we want the probability that all the people in our sample wore TSDs, we just look and see if all wore them or not. The probability is 0 or 1, and is 0 or 1 for any other question we can ask about the observed data (e.g. what is the probability that half or more wore them? again, 0 or 1).

Thus, statistics are useful only for making inferences about unobserved data: usually future data, but really just unknown to you. If you want to make statements or quantify uncertainty in data you have not yet seen, then you need probability models. Some would say statistics are useful for making inferences about unobserved and unobservable parameters, but I’ll try to dissuade you of that opinion in this essay. We have to start, however, with describing what these parameters are and why so much attention is devoted to them.

Before we do, we have to return to our question, which was roughly phrased in English as “How many people wear TSDs?”, and we have to turn it into a mathematical question. We do this by forming a probability model for the English question. If you’ve read some of my earlier posts, you might recall that we have an essentially infinite choice of models which we could use. What we would like is if we could limit our choice to a few or, best of all, to logically deduce the exact model given some set of information that we believe true.

Here is one such statement: `M1` = “The probability that somebody wears a TSD (at the locations and times specified for our for our exactly defined population subset) is fixed, or constant, and knowing whether one person wears a TSDs gives us no information whether any other person wears a TSD.” (Whenever you see `M1`, substitute the sentence “The probability…”)

Is `M1` true? Almost certainly not. For example, if two people walk by our observation spot together, say a couple, it might be less likely for either to wear a TSD than it is for two separate people. Again people (not all people, anyway) aren’t going to wear a TSD at all hours equally often, and not equally often at all locations within our subset either.

But let’s suppose that `M1` is true anyway. Why? Because this is what everybody else does in similar situations, which they do because it allows them to write a simple and familiar probability model for the number of people `x` out of `n` wearing TSDs. Here is the model for the data we just observed:

`Pr( x = k | n, θ, M1)`

This is actually just a script or shorthand for the model, which is some mathematical equation (binomial distribution), and not of real interest; however it is useful to learn how to read the script. From left to right, it is the probability that the number of people `x` equals some number `k` given we know `n`, something called θ, and `M1` is true. This is the mathematical way of writing the English question.

The variable `x` is more shorthand meaning “number of people who wore a TSD”. Before we did our experiment, we did not know the value of `x`, so we say it was “random.” After we see the data we know `k`, the actual number of new people out of the `n` people we saw who did wear a TSD. OK so far? We already understand what `M1` is, so all that is left to explain is θ What is it?

It is a parameter, which if you recall previous posts, is an unobservable, unmeasurable number, but which is necessary in order to formulate our probability model. Some people incorrectly call θ “the probability that a single person wears a TSD.” This is false and is an example of the atrocious and confusing terminology so often used in statistics (look in any introductory text and you’ll see what I mean). θ, while giving the appearance of one, is no sort of probability at all. It would be a probability if we knew its value. But we do not: and if we did know, we never would have bothered collecting data in the first place! Now, look carefully. θ is written on the right hand side of the “|”, which is where we put all the stuff that we believe we know, so again it looks as if we are saying we know θ, so it looks like a probability.

But this is because the model is incomplete. Why? Remember that we don’t really need to model the observed data if that is all we are interested in. So the model we have written is only part of a model for future data. There are several pieces that are missing. Those pieces are another probability model for the value of θ, a model for just the observed data, a model for the uncertainty in θ given the observed data, the data model itself again, which are all mathematically manipulated to produce this creature

`Pr( xnew = knew | nnew, xold, nold, M1)`

which is the true probability model for new data given what we observed with the old data. There is no way that I can even hope to explain this new model without resorting to some heavy mathematics. This is in part why classical statistics just stops with the fragmentary model, because it’s easier. In that tradition, people create a (non-verifiable) point estimate of θ, which means just plugging some value for θ into the probability model fragment, and then call themselves done.

Well, almost done. Good statisticians will give you some measure of uncertainty of the guess of θ, some plus or minus interval. (If you haven’t already, go back and read the post “It depends on what the meaning of mean means.”) The classical estimate used for θ is just the computed mean, the average of the past data. So the plus and minus interval will only be for the guess of the mean. In other words, just as it was in regression models, it will be too narrow and people will be overconfident when predicting new data.

All this is very confusing, so now—finally!—was can return to the data collected by those folks who turned in their homework and work through some examples.

There were 6 separate collections, which I’ll lump together with the clear knowledge that this violates the limits of our population subset (two samples were taken in foreign countries, one China and one New Jersey). This gave `x = 58` and `n = 635`.

The traditional estimate of θ is `58/635 = 0.091`, with the plus minus interval of `0.07 to 0.12`. Well, so what? Remember that our goal is to estimate the number of people who wear TSDs, so this classical estimate of θ is not of much use.

If we just plug in the best estimate of θ to estimate, out of 300 million (the approximate population of the U.S.A.), how many wear TSDs, we get a guess of 27.4 million with a plus-minus window of 27.39 to 27.41 million, which is a pretty tight guess! The length of that interval is only about 20,000 people wide. This is being pretty sure of ourselves, isn’t it?

If we use the modern estimate, we get a guess of 25.5 million, with a plus-minus window of about 19.3 to 31.7 million, which is much wider and hence more realistic. The length of this interval is 12.4 million! Why is this interval so much larger? It’s because we took full account of our uncertainty in the guess of θ, which the classical plug-in guess did not (we essentially recompute a new guess for every possible value of θ and weight them by the probability that θ equals each value: but that takes some math).

Perhaps these numbers are too large to think about easily, so let’s do another example and ask how many people riding a car on the F train wear a TSD. The car at rush hour holds, say, 80 people. The classical guess is 7, with +/- of 3 to 13. The modern guess is also 7 with +/- of 2 to 12. Much closer to each other, right?

Well, how about all the students in a typical college? There might be about 20,000 students. The classical guess is 1750 with +/- 1830 to 1910. The modern is 1700 with +/- 1280 to 2120.

We begin to see a pattern. As the number of new people increases, the modern guess becomes a little lower than the classical one, and the uncertainty in the modern guess is realistically much larger. This begins to explain, however, why so many people are happy enough with the classical guesses: many samples of interest will be somewhat small, so all the extra work that goes into computing the modern estimate doesn’t seem worth it.

Unfortunately, that is only true because we had such a large initial data collection. If, for example, we only had Steve Hempell’s, which was `x = 1` and `n = 41`, and we were interested still in the F train, then the classical guess is 2 with +/- 0 to 5; and the modern guess 4 +/- 0 to 13! The difference between the two methods is again large enough to make a difference.

Once again, we have done a huge amount of work for a very, very simple problem. I hope you have read this far, but I would not have blamed you if you hadn’t because, I am very sorry to say, we are not done yet. Everybody who remembers `M1` raise their hands? Not too many. Yes, all these guesses were conditional on `M1` being true. What if it isn’t? At the least, it means that the guesses we made are off a little and that we must widen our plus-minus intervals to take into account our uncertainty in the correctness of our model.

Which I won’t do because I am, and you are probably, too fatigued. This is a very simple problem, like I said. Imagine problems with even more complicated statistics where uncertainty comes at you from every direction. There the differences between the classical and modern way are even more apparent. Here is the second answer for our homework:

1. Too many people are far too certain about too many things

## 31 Thoughts

1. I finally decided to write a comment on your blog. I just wanted to say good job. I really enjoy reading your posts.

Tina Russell

2. Bruce Foutch says:

To paraphrase Gary Larson’ The Far Side:

“Mr. Briggs, may I be excused? My brain is full.”

My brain capacity aside, I thank you for continuing to provide an understandable overview of a very difficult subject. In some of my other activities the term “sheeple” is often used with obvious connotation. Therefore, another thank you for doing all this work to help some of us sheeple break free of the flock.

3. noahpoah says:

Well, I finished the post, and I’m not fatigued (but that’s just because I have a higher than normal/healthy capacity for reading about statistics). I’m curious to know more about how you calculated the classical and ‘modern’ error ranges on theta.

Also, I know it’s nit-picky, and correct me if I’m wrong, but x/n isn’t the (or a) mean. The parameter in a binomial model is a probability, though not an observed, or even observable, one. Rather, it’s the purported underlying probability of a ‘success’ (e.g., the presence of a TSD).

4. Bernie says:

Noahpoah!!
Good post. I am also interested in the mysterious variance between classic and modern error ranges.

5. Bernie says:

Matt:
Interesting post.
Now I have a question that is more related to pyschological predispositions to see patterns where none exist.
My wife has an idiographic memory. She swears that when she sees the star or galaxy screen saver that she sees a recurring pattern. I argues that in principle becuase it is a screen saver that the apparent patterns are simply apparent and not real, because the screen saver is driven by one or more random number generators governing position, brightness, direction so that no pixels will be “burned in”. By comparison, I have a lousy memory so have a small chance of seeing a pattern if one did in fact exist. My wife’s response to randomness is that randomness is not infinite!! How can we definitively resolve this question and save our 30 year marriage!!

p.S. She had trouble with logarithms, so there is no deep mathematical reasoning in her potential aphorism.

6. Briggs says:

Noah,

The estimate `x/n` is indeed the mean: think of everybody who wore a TSD having `x = 1`, and everybody who did not having `x = 0`. So `x/n` is just the mean.

The classic interval on θ is just the standard confidence interval assuming a binomial distribution (in R `binom.test`).

The modern interval uses a binomial too, with a beta Jeffreys prior on θ The posterior on θ which gives the interval, is also beta.

The classic plug-in predictive is still binomial; the modern predictive interval turns out to be a beta-binomial.

I’m putting all this stuff in my Stats 101 book.

Briggs

7. Briggs says:

Bernie,

I have to agree with your wife. Each pattern she sees is certainly real—unless her eyes are tricking her into seeing dots where none are; visual illusions are common enough. But supposing that she isn’t hallucinating, then each pattern she sees is there.

She’s also right about the computer-generated randomness not being infinite.

However, so what? I mean, some algorithm (I’ll describe this in my posting debunking the mysticism of randomness) chose those points and placed them on the screen in the pattern you see.

The tautology “every pattern is a pattern” might help us understand what is happening. Not every pattern produced matches exactly something else familiar, like a face or car or etc. But every pattern matches inexactly something familiar; the matching can be as inexact as you like, using any closeness metric.

Thus, it is possible that your wife sees patterns that seem to repeat themselves, however inexactly. If you marry that to a desire to find common points, then you are almost sure to come up with matches.

In other words, I think you were both right.

Briggs

8. Bernie says:

Matt:
Thanks for the response – I think! I can’t tell you how pleased my wife is. She will never listen to me again – but then it was a rare event before!!

As to your response – I absolutely agree. But I do believe that the ability to see patterns in complex arrays of data is a function of the attributes of the array, the observer’s cognitive and physiological capabilities, and the observer’s beliief or expectation that patterns do or don’t exist. My wife’s visual memory predisposes her to recognize repeated identical patterns that others would not see and her unfamilairity with how screen savers are created leaves her with an expectation that patterns will exist.

On the other hand, can you write out the difference in classic and modern interval using standard notation.

Finally, when is the book due.

9. steven mosher says:

WTF Over ( hey I had a TS/SAR, you crypto guy)

“with a beta Jeffreys prior on θ The posterior on θ which gives the interval, is also beta.”

That Engilsh is greek to me son. splain it agin. use picktures and crayons and short words.

10. Briggs says:

Steven,

More pictures and equations coming soon. I’ll let you page count the first draft of my book.

Briggs

11. steven mosher says:

i started reading the book, so far so good. a couple typos and a bit chatty, but i like the style. keep up the good work

12. Briggs says:

Steven,

That’s a pretty old copy. When I get back from Spain in about two weeks, I’ll be working more full time on the two books. I’ll post about them later.

13. steven mosher says:

Ok, I’m looking forward to it

14. W J B says:
15. Let me try to answer Stephen’s question.

Bayesians (the moderns) assume in effect that a smaller set of possible outcomes are possible than do frequentists (the classicals). Frequentists set up an experiment with rules. Bayesians wing it. Both use math, but the Baysians use MORE math, because they think they are smarter than dumb old frequentists. Bayesians have more self-confidence; frequentists have less. And it shows in the results!

The lesson here is that people who wear thinking suppression devices generally have more fun, whereas the thinkers are all bound up in worry knots. Ever really look at Rodin’s statue The Thinker? Try to contort your body into that position. I guarantee it hurts!

16. Briggs says:

Mike,

Ouch. Hmm. I would say that there is more math involved, usually, in frequentist calculations. After all, each problem requires a new statistic, whose distribution must be calculated (and approximated many different ways). And of course, there are usually dozens of competitor statistics for every problem. In fact, most mathematical statisticians are frequentists. Which isn’t surprising, since a lot of those guys come to it from probability, the mathematical field of probability, I mean, which tends to eschew interpretation and focus on theorems.

All of which is fine. But, loosely, statistics is applied probability, so I argue the probability statements you make for real data have to be realistic. Which also makes me unlike subjective Bayesians. I am not one of those creatures either. Besides willful subjectivity, they too concentrate their attentions almost solely on non-observable parameters, just like frequentists.

My goal is now to convince you that philosophy matters. That we have taken a lot for granted when we should not have been.

Oh, yeah. Just like Jesse “The Mind” Ventura, I tried out The Thinker and found the position comes natural. I’m thinking of putting my pose on my business cards, because I look good.

17. Bernie says:

It sounds like there should be some interesting philosophically oriented references that contrast “frequentists” and Bayesians. Any recommendations for those who have forgotten their calculus?

18. Bernie says:

Matt:
I know my homework is late, but I dont normally hang around public places staring at people. But this morning I had to take the T and being forced to stand I had nothing else to do but stare at people. So, between 8:10 and 8:25 I observed 45 people of highly varied backgrounds. Of these 6 were wearing ear pieces (to what they were attached I know not). They seemed to be moving to some unseem rythm so I doubt they were books on tape or language tapes or the modern equivalents.
)f the six, 3 were men and 3 were women.

19. Steve Hempell says:

Briggs,

This is OT, but there is some very interesting stuff statistically going on at Watts up with that. Can we get another statistician in on this? Just think its interesting that some different statistics on being applied to the temperature data – somewhat like you did with RSS.

20. Dear Dr. Briggs,

If you could help out…

“Pr( xnew = knew | nnew, xold, nold, M1)”

Would it be too much to ask that you take a look at the statements you’re making, and maybe use a larger or different font? I’m hoping that at least some of the folks who view my blog come over here to learn something new.

I really like what you’re doing. But…I don’t know if it’s my browser (Mozilla) but your statements are hard to read. Middle-age problem? Mebbe.

Also, a little more narrative on the meaning of these statements would be nice. A lot of people haven’t been in a classroom for a lotta years. I appreciate what you’re doing. Mebbe some footies for the over forty crowd? Please?

I’d hate to think that foks who are marginally math challenged give up when they hit the statements.

Oh, and for the advanced guys commenting here, please don’t get harsh on comments from newbs like me, okay? I tend to ask the most stupid questions. But am like a pit bull after that. Tenacious, that’s me.

Thank you for what you’re doing. Teaching is like a drug, innit?

21. mr b says:

going the distance—I will.

22. Dr. Briggs,

I apologize. It was late and I wasn’t thinking. I love your blog and look forward to your treatise on the mysticism of randomness. I hope you also touch on the Laws of Probability and why they aren’t called theories.

I agree that philosophy matters. Statistics is more than playing with numbers; it is the mathematical language of rationality and the scientific method. It is the Cult of Logical Inference, and as a card-carrying cultist I accept on faith that logical inferences exist. Propositions are either probably true or probably false, and we can measure the probability, but only because the Truth is out there somewhere. The moderns and classicals are not different in that respect.

As for looking good, you’ll have to go some to beat Miss Teen South Carolina in a short skirt. But this blog is exploring some lovely ideas, so I’ll have to factor that in.

PS Are you saying that even though A Rod bats .333, he never gets a third of a hit?

23. re: # 20

And lucia and tamino, too. And of course tamino and Steve McIntyre on PCA.

It’s all getting very confusing to me.

If stats was maths we wouldn’t be in this situation. In maths either it is or it isn’t. Plus some physical causality would be very helpful.

24. steven mosher says:

Would be nice to get a baysian angle on Lucias work.. Mostly so I could get a grasp of the baysian thing

25. Bernie says:

I think the big belly linked to dementia story fits into this topic.
See for example http://abcnews.go.com/Health/wireStory?id=4534488
First, I hasten to add that I am not overweight and have a 34″ waist.
What struck me was how poorly articulated were the findings from this study – which is also over 2 years old to boot. The logic boils down to – we see a correlation between body shape and dementia, therefore something that is associated with body shape causes changes in the brain that leads to dementia. At this point, the authors would have been better of saying
“we are not certain if this reflects a real causal relationship and if there is a real causal relationship what the nature of that process is.”
But the then the reporter would be forced to ask themselves is the story worth printing. Insurance companies – like Kaiser Permanate who sponsored the study – may simply take the actuarial view and scare the fat of people.

26. Martin Ringo says:

Suppose your model were the following:
The probability of any individual wearing the device, D, is a function of the group (read age, sex, nationality, education, etc. etc.), the time (hour, day of week, etc.), some random component, which we might call taste preferences, and whether the individual was wearing the device in period t-1.
P(i(g),t) = f( g, t, error) + rho*X(i,t-1)

Now if we take a single sample, even very large, we have both a nasty problem with the distribution of the estimator — probably can only get it with empirical methods — and a nastier problem of sampling. We certainly don’t have the classical random sample with i.i.d. variates.

If I told you that group probabilities (for some “average” t and error) were essentially uniform, wouldn’t you have a tendency to ignore any one or two or M-1 samples (where M being the smallest number to feel confident about the distribution of group probabilities)?

Anyway, while I am not an epidemiologist — although I do get sick — it seems that the basic problem is one of epidemiology, at least to assessing the population propensity of incidence or carrying.

27. Bruce Foutch says:

Good morning Dr. Briggs,

“horror vacui”

I see an empty green tab at the top of your blog page. I wonder if you would consider making it into a “References” tab, providing links to references mentioned in your blog and/or to information, including your past blog entries, that you think might provide some of us non-statistician folks a core of basic information that we can refer to, to better follow along with your teachings.

Perhaps the following paper and rebuttal might be candidates:

# Published August 30, 2005 – ESSAY Why Most Published Research Findings Are False Ioannidis JPA PLoS Medicine Vol. 2, No. 8, e124 doi:10.1371/journal.pmed.0020124

# Published April 24, 2007 – CORRESPONDENCE Why Most Published Research Findings Are False: Problems in the Analysis Goodman S, Greenland S PLoS Medicine Vol. 4, No. 4, e168 doi:10.1371/journal.pmed.0040168

28. Chuck Peterson says:

Imagine the study done in 10 random cities worldwide in 10 locations each (ranked say into 10 economic zones, 5 easily near transportation and 5 not or something, where we get a representative mix of people). Percentage of people using an electronic device of some sort; computer, cell phone, music player. Results for each city:

1, 20, 15, 8, 30, 55, 45, 2, 18, 5

“A recent study of cities worldwide shows one out of five people uses some sort of electronic device when outside.”

Even if the same number were counted in each location in each city (say 1000 each) and the cities are somewhat fairly dispersed and not all alike, does that 20% really tell us anything? Also, perhaps the next day the numbers would be vastly different. And where were the locations of the places with the lower percentages versus higher? What time of day? Places where 1000 people were observed in what period of time versus another? This tells us nothing.

29. TCO says:

30. TCO says:

And please move the bookmark thing to the other side of the comment post or get rid of it or something. I hit it every time I try to post.