Skip to content

Author: Briggs

March 21, 2008 | 31 Comments

Homework #1: Answer part II

In part I, we learned that all surveys, and in fact all statistical models, are valid only conditionally on some population (or information). We went into nauseating detail of the conditional information on our own survey of people who wear thinking suppression devices (TSDs; see the original posts), so I’ll skip repeating any of it again.

Today, we look at the data and ignore all other questions. The first matter we have to understand is: what are probability models and statistics for? Although we use the data we just observed to fit these models, they are not for that data. We do not need to ask probability questions of the data we just observed, there is no need to. If we want the probability that all the people in our sample wore TSDs, we just look and see if all wore them or not. The probability is 0 or 1, and is 0 or 1 for any other question we can ask about the observed data (e.g. what is the probability that half or more wore them? again, 0 or 1).

Thus, statistics are useful only for making inferences about unobserved data: usually future data, but really just unknown to you. If you want to make statements or quantify uncertainty in data you have not yet seen, then you need probability models. Some would say statistics are useful for making inferences about unobserved and unobservable parameters, but I’ll try to dissuade you of that opinion in this essay. We have to start, however, with describing what these parameters are and why so much attention is devoted to them.

Before we do, we have to return to our question, which was roughly phrased in English as “How many people wear TSDs?”, and we have to turn it into a mathematical question. We do this by forming a probability model for the English question. If you’ve read some of my earlier posts, you might recall that we have an essentially infinite choice of models which we could use. What we would like is if we could limit our choice to a few or, best of all, to logically deduce the exact model given some set of information that we believe true.

Here is one such statement: M1 = “The probability that somebody wears a TSD (at the locations and times specified for our for our exactly defined population subset) is fixed, or constant, and knowing whether one person wears a TSDs gives us no information whether any other person wears a TSD.” (Whenever you see M1, substitute the sentence “The probability…”)

Is M1 true? Almost certainly not. For example, if two people walk by our observation spot together, say a couple, it might be less likely for either to wear a TSD than it is for two separate people. Again people (not all people, anyway) aren’t going to wear a TSD at all hours equally often, and not equally often at all locations within our subset either.

But let’s suppose that M1 is true anyway. Why? Because this is what everybody else does in similar situations, which they do because it allows them to write a simple and familiar probability model for the number of people x out of n wearing TSDs. Here is the model for the data we just observed:

Pr( x = k | n, θ, M1)

This is actually just a script or shorthand for the model, which is some mathematical equation (binomial distribution), and not of real interest; however it is useful to learn how to read the script. From left to right, it is the probability that the number of people x equals some number k given we know n, something called θ, and M1 is true. This is the mathematical way of writing the English question.

The variable x is more shorthand meaning “number of people who wore a TSD”. Before we did our experiment, we did not know the value of x, so we say it was “random.” After we see the data we know k, the actual number of new people out of the n people we saw who did wear a TSD. OK so far? We already understand what M1 is, so all that is left to explain is θ What is it?

It is a parameter, which if you recall previous posts, is an unobservable, unmeasurable number, but which is necessary in order to formulate our probability model. Some people incorrectly call θ “the probability that a single person wears a TSD.” This is false and is an example of the atrocious and confusing terminology so often used in statistics (look in any introductory text and you’ll see what I mean). θ, while giving the appearance of one, is no sort of probability at all. It would be a probability if we knew its value. But we do not: and if we did know, we never would have bothered collecting data in the first place! Now, look carefully. θ is written on the right hand side of the “|”, which is where we put all the stuff that we believe we know, so again it looks as if we are saying we know θ, so it looks like a probability.

But this is because the model is incomplete. Why? Remember that we don’t really need to model the observed data if that is all we are interested in. So the model we have written is only part of a model for future data. There are several pieces that are missing. Those pieces are another probability model for the value of θ, a model for just the observed data, a model for the uncertainty in θ given the observed data, the data model itself again, which are all mathematically manipulated to produce this creature

Pr( xnew = knew | nnew, xold, nold, M1)

which is the true probability model for new data given what we observed with the old data. There is no way that I can even hope to explain this new model without resorting to some heavy mathematics. This is in part why classical statistics just stops with the fragmentary model, because it’s easier. In that tradition, people create a (non-verifiable) point estimate of θ, which means just plugging some value for θ into the probability model fragment, and then call themselves done.

Well, almost done. Good statisticians will give you some measure of uncertainty of the guess of θ, some plus or minus interval. (If you haven’t already, go back and read the post “It depends on what the meaning of mean means.”) The classical estimate used for θ is just the computed mean, the average of the past data. So the plus and minus interval will only be for the guess of the mean. In other words, just as it was in regression models, it will be too narrow and people will be overconfident when predicting new data.

All this is very confusing, so now—finally!—was can return to the data collected by those folks who turned in their homework and work through some examples.

There were 6 separate collections, which I’ll lump together with the clear knowledge that this violates the limits of our population subset (two samples were taken in foreign countries, one China and one New Jersey). This gave x = 58 and n = 635.

The traditional estimate of θ is 58/635 = 0.091, with the plus minus interval of 0.07 to 0.12. Well, so what? Remember that our goal is to estimate the number of people who wear TSDs, so this classical estimate of θ is not of much use.

If we just plug in the best estimate of θ to estimate, out of 300 million (the approximate population of the U.S.A.), how many wear TSDs, we get a guess of 27.4 million with a plus-minus window of 27.39 to 27.41 million, which is a pretty tight guess! The length of that interval is only about 20,000 people wide. This is being pretty sure of ourselves, isn’t it?

If we use the modern estimate, we get a guess of 25.5 million, with a plus-minus window of about 19.3 to 31.7 million, which is much wider and hence more realistic. The length of this interval is 12.4 million! Why is this interval so much larger? It’s because we took full account of our uncertainty in the guess of θ, which the classical plug-in guess did not (we essentially recompute a new guess for every possible value of θ and weight them by the probability that θ equals each value: but that takes some math).

Perhaps these numbers are too large to think about easily, so let’s do another example and ask how many people riding a car on the F train wear a TSD. The car at rush hour holds, say, 80 people. The classical guess is 7, with +/- of 3 to 13. The modern guess is also 7 with +/- of 2 to 12. Much closer to each other, right?

Well, how about all the students in a typical college? There might be about 20,000 students. The classical guess is 1750 with +/- 1830 to 1910. The modern is 1700 with +/- 1280 to 2120.

We begin to see a pattern. As the number of new people increases, the modern guess becomes a little lower than the classical one, and the uncertainty in the modern guess is realistically much larger. This begins to explain, however, why so many people are happy enough with the classical guesses: many samples of interest will be somewhat small, so all the extra work that goes into computing the modern estimate doesn’t seem worth it.

Unfortunately, that is only true because we had such a large initial data collection. If, for example, we only had Steve Hempell’s, which was x = 1 and n = 41, and we were interested still in the F train, then the classical guess is 2 with +/- 0 to 5; and the modern guess 4 +/- 0 to 13! The difference between the two methods is again large enough to make a difference.

Once again, we have done a huge amount of work for a very, very simple problem. I hope you have read this far, but I would not have blamed you if you hadn’t because, I am very sorry to say, we are not done yet. Everybody who remembers M1 raise their hands? Not too many. Yes, all these guesses were conditional on M1 being true. What if it isn’t? At the least, it means that the guesses we made are off a little and that we must widen our plus-minus intervals to take into account our uncertainty in the correctness of our model.

Which I won’t do because I am, and you are probably, too fatigued. This is a very simple problem, like I said. Imagine problems with even more complicated statistics where uncertainty comes at you from every direction. There the differences between the classical and modern way are even more apparent. Here is the second answer for our homework:

  1. Too many people are far too certain about too many things
March 20, 2008 | 4 Comments

Homework #1: Answer part I

A couple of days ago I gave out homework. I asked my loyal readers to count how many people walked by them and to keep track of how many of those people wore a thinking-suppression device like an I-pod etc. Like every teacher, my heart soared like a hawk when some of the students actually completed the task. Visit the original thread’s comments to see the “raw” data.

The project was obviously to recreate a survey of the kind which we see daily: e.g. What percent of Americans favor a carbon tax? What fraction of the voters want “change”? How many prefer Brand A? And so on.

Here is how a newspaper might present the results from our survey:

More consumers are endangering their hearing than ever before, according to new research by Over 20% of consumers now never leave the house without an I-pod or I-pod-like device.

“Music is very popular” said Dr Briggs, “And now it’s easier than ever before to listen to it.” This might help explain the rise in tinnitus reports, according to some sources. Dr So Undzo of the Send Us Money to Battle Tinnitus Foundation was quoted as saying, “Blah blah blah.” He also said, “Blah blah blah blah blah.” &tc. &tc.

Despite its farcical nature, this “news” report is no different than the dozens that show up on TV, the radio, and everywhere else. In order to tell a newsworthy story, it extrapolates wildly from the data at hand, it gives you no idea who collected the original data or why (for money? for notoriety?) or how (by observation? by interview?), or of any of the statistical methods used to manipulate the data. In short: it is very nearly worthless. The only advantage a story like this has is that it can be written before any data is actually taken, saving time and money to the news organization issuing it.

But you already knew all that. So let’s talk about the real problem with statistics. Beware, however, that some of this is dull labor, requiring attention to detail, and probably too much work for too little content. However, that’s how the get you, by hoping you pass by quickly and say “close enough.”

We had five to six responses to the homework so far, but we’ll start with the first one from Steve Hempell. He saw n=41 people and counted m=1 wearing a thinking-suppression device (TSD). He sat on a bench in a small town during spring break to watch citizens pass by.

The first thing we need to have securely in our minds is what question we want to answer with this data. The obvious one is “How many people regularly wear a TSD?” This innocent query begins our troubles.

What do we mean by “people”? All people? There are a little over 6 billion humans now. Do we want an estimate from that group? What about historical, i.e. dead, people, or those yet to be born? How far back into the future or past do we want to go? Are we talking of people “now”? Maybe, but we still have to define “now”: does it mean in a year or two, or just the day the survey was taken or a few days into the future? Trivial details? Well, we’ll see. Let’s settle on the week after the survey was taken so that our question becomes “How many people in the week after our survey was taken regularly wear a TSD?”

We’re still not done with “people” and haven’t decided whether it was all humans or some subset. The most common subset is “U.S. Americans” (as Miss Teen South Carolina would have phrased it). But all U.S. citizens? Presumably, infants do not wear TSDs, nor do many in nursing homes or in other incarcerations. Were infants even counted in the survey? Older people in general, experience tells us, do not often wear TSDs. As I think about this question, I find myself unable to rigorously quantify the subset of interest. If I say “All U.S. citizens” then my eventual estimate would probably be too high, given this small sample. If I say, “U.S. citizens between the ages of 15 and 55” then I might do better, but the survey is of less interest.

To pick something concrete, we’ll go with “All U.S. citizens” which modifies our question to “How many U.S. citizens in the week after our survey was taken regularly wear a TSD?”

Sigh. Not done yet. We still have to tackle “regularly” and the bigger question of whether or not our sample represents fairly the population we have in mind, and would still leave the largest, most error-prone area: what exactly is an TSD? I-pods were identified, but how about cell phones or Blackberries and on and on? Frankly, however, I am bored.

Like I said, though, boredom is the point. No one wants to invest as much time as we have for this simple survey to each survey they meet. No matter how concrete the appropriate population in a survey seems to you, it can mean something entirely different to somebody else; each person can take away their own definition. This ambiguity, while frustrating to me, is gold to marketers, pollsters, and “researchers.” So vaguely worded are surveys that the reader can supply any meaning they want to its results. Although they usually consciously aware of it, people read surveys like they read horoscopes or psychic readings: they always seem accurate or to confirm people’s worst fears or hopes.

An objection might have occurred to you. “Sure, these complex surveys are ambiguous. But there are simple polls that are easy to understand. The best example is ‘Who will you vote for, Candidate A or B?’ Not much to confuse there.”

You mean, since a poll is a prediction of ballot results, besides trusting that the pollster found a population representative of people who will actually vote on election day? That no event between the time the poll was taken and the election occurs that will cause people to change their minds? And—pay attention here—nobody lied to the pollster?

“Oh, too few people lie to make a difference.” Yeah? Well, I live in New York City and I like to tell the story of the exit polls taken for the presidential race between Kerry and Bush. Those polls had Kerry ahead by about 10 to 1, a non-surprising result, and one which confirmed people’s prior beliefs. The pollsters asked tons of voters and were spread throughout the city in an attempt to obtain the most representative sample they could. Not everybody would answer them, of course, and that is still another problem which is impossible to tackle.

But when the actual results were tallied, Kerry won by only a margin about a little under 5 to 1. Sure, he still won, but the real shocker is that so many people lied to the pollster. And why? Well, this is New York City, and in Manhattan particularly, you just cannot easily admit to being a Bush supporter (then or now). At the least, doing so invites ridicule, and who needs that? Simpler just to lie and say, “I voted for Kerry.”

We have done a lot and we still haven’t answered the question of how to handle the actual data!

Here are the answers to part I of the homework.

  1. The applicability of all surveys is conditional on a population which must be, though rarely is, rigorously defined.
  2. All surveys have significant measurement error that has nothing to do with the actual numerical data.
  3. Because of this, people are too certain when reading or interpreting the results of surveys

In part II, if we are not already worn down, we will learn how to—finally!—handle the data.

March 17, 2008 | 13 Comments

Homework #1

I was reminded of this homework problem that I give my students as I was riding in on the F train this morning. It is a very good problem because it is exceedingly simple and nicely demonstrates two problems of the classical way of looking at statistics.

All you need to do this homework is a busy place and some free time, about 20 minutes.

Find a spot where people congregate or pass by. Be sure to carefully and concretely specify this place: keep its boundaries fixed and rigid for the duration of the homework.

Count the people in the spot, either all at once, or as they pass by for some fixed time (decided in advance). Also count the number of people who are wearing some sort of thinking-suppression device. There are obviously any number of other things you can take note of, like sex, age, etc., but we’ll ignore all of them.

Report back to me (in the Comments) the two numbers, number wearing thinking-suppression devices, which will be less than or equal to the total number of people. Also note details of your spot.

We are obviously going to be talking about forming ratios and estimating probabilities. I’ll discuss what all this means–and what it does not mean—once a few people have turned in the assignment.

Oh, yes. A thinking-suppression device is anything like an I-Pod, MP3 player, etc. etc.

March 14, 2008 | 10 Comments


Here is the link to the symposium which I mentioned a few weeks back. It is being sponsored by the Ram?n Areces Foundation and the Royal Academy of Sciences of Spain, and will be held in Madrid on the 2nd and 3rd of April. Part of the introduction says:

The Royal Academy of Sciences of Spain and the Ram?n Areces Foundation wish to contribute to the creation of an informed public opinion on global change in the country. To this end, they are organising a two-day symposium aimed at scientists from different fields, decision makers and general public. Existing facts and analysis tools will be discussed, and the robustness and uncertainties of predictions made on the basis of the former, critically assessed. The meeting will provide a scientific view of existing knowledge on climate change and its expected consequences. Existing physical, chemical and mathematical tools will be discussed and climate effects will be analysed together with other concurrent changes, which tend to be overlooked in the climate change scenarios.

Presentations by the different contributors will emphasise existing scientific evidence as well as the strengths and weaknesses of predictions made on the basis of available data and modelling tools. Contributors are encouraged to express their opinions on the most relevant problems concerning the topics they will present, including scientific issues, main threats and possible mitigation or adaptation strategies.

The program is now online. My talk is entitled “Robustness and uncertainties of climate change predictions”. The deadline for me to turn it in is today. I am still working on it and not at all satisfied that I have done a good job with my topic. I am simultaneously writing a paper and the talk, and I will post both of them here, not un-coincidentally, on 1 April.

The gist of my talk I have summarized:

Global warming is not important by itself: it becomes significant only when its effects are consequential to humans. The distinction between questions like “Will it warm?” and “What will happen if it warms” is under-appreciated or conflated. For example, when asking how likely are the results of a study of global warming’s effects, we are apt to confuse the likelihood of global warming as a phenomenon with what might happening because of global warming. When of course the two kinds of questions and likelihoods are entirely separate.

Because of the frequency of confusion, I want to follow the path to the conclusion of one particular study whose results state A = “There will be More kidney and liver disease, ambulance trips, etc. because of global warming.” I start from first principles, and untangle and carefully focus on the chain of causation leading up this central claims, and quantify the uncertainty of the steps along the way.

In short, I will estimate the probability that AGW is real, the probability that some claim of global warming’s effects is true given global warming is true, and the unconditional probability that the effect is true. That’s not too much to tackle, is it?

Thank God there will be simultaneous translation of the conference, because my Spanish is getting worse and worse the more I think about it. If I was going to play soccer, then I’d be on more familiar ground. I do know how to ask that a ball be passed to me because I am alone an unguarded, and how to offer constructive criticism to a fellow teammate for not recognizing this fact and for taking a ridiculous shot at goal himself. But I am not sure how this language would apply to global warming.