Posts filed under 'Bad Statistics'
The Institute of Mathematical Statistics (I am a member) has issued a report on the wide-spread misuse of Citation Statistics.
The full report may be found here.
The non-surprising main findings are:
- Statistics are not more accurate when they are improperly used; statistics can mislead when they are misused or misunderstood.
- The objectivity of citations is illusory because the meaning of citations is not well-understood. A citation’s meaning can be very far from “impact”.
- While having a single number to judge quality is indeed simple, it can lead to a shallow understanding of something as complicated as research. Numbers are not inherently superior to sound judgments.
The last point is not just relevant to citation statistics, but applies equally well to many areas, such as (thanks to Bernie for reminding me of this) trying to quantify “climate sensitivity” with just one number.
More findings from the report:
- For journals, the impact factor is most often used for ranking. This is a simple average derived from the distribution of citations for a collection of articles in the journal. The average captures only a small amount of information about that distribution, and it is a rather crude statistic. In addition, there are many confounding factors when judging journals by citations, and any comparison of journals requires caution when using impact factors. Using the impact factor alone to judge a journal is like using weight alone to judge a person’s health.
- For papers, instead of relying on the actual count of citations to compare individual papers, people frequently substitute the impact factor of the journals in which the papers appear. They believe that higher impact factors must mean higher citation counts. But this is often not the case! This is a pervasive misuse of statistics that needs to be challenged whenever and wherever it occurs.
- For individual scientists, complete citation records can be difficult to compare. As a consequence, there have been attempts to find simple statistics that capture the full complexity of a scientist’s citation record with a single number. The most notable of these is the h‐index, which seems to be gaining in popularity. But even a casual inspection of the h‐index and its variants shows that these are naive attempts to understand complicated citation records. While they capture a small amount of information about the distribution of a scientist’s citations, they lose crucial information that is essential for the assessment of research.
I can report that many in medicine fixate and are enthralled by a journal’s “impact factor”, which is, as the report says, a horrible statistic—with an awful sounding name. The “h index” is “the largest n for which he/she has published n articles, each with at least n citations.”
Naturally, now that we statisticians have weighed in on the matter, we can expect a complete stoppage in the usage of citation statistics.
June 30th, 2008
Nothing is true unless it has been demonstrated and published in a peer-reviewed journal. For example, until last week, many people suspected that when men look at nearly or completely naked women, they tend to be distracted. Anybody who believed that was foolish to do so because it had never been “scientifically” proven.
If they did believe it, they probably did so based on their academically-discredited intuitions. Amateurs.
But scientific researchers Bram Van den Bergh, Siegfried Dewitte,and Luk Warlop have finally leant scientific credibility to the popular belief, which we are now free to label as “scientific.” These researchers published their stunning findings in the June 2008 issue of the Journal of Consumer Research. The journal article was summarized in a newspaper report here.
The title of their article is “Bikinis Instigate Generalized Impatience in Intertemporal Choice.” Their abstract follows
Neuroscientific studies demonstrate that erotic stimuli activate the reward circuitry processing monetary and drug rewards. Theoretically, a general reward system may give rise to nonspecific effects: exposure to “hot stimuli” from one domain may thus affect decisions in a different domain. We show that exposure to sexy cues leads to more impatience in intertemporal choice between monetary rewards. Highlighting the role of a general reward circuitry, we demonstrate that individuals with a sensitive reward system are more susceptible to the effect of sex cues, that the effect generalizes to nonmonetary rewards, and that satiation attenuates the effect.
In you cannot read this, do not worry, for it is not written in English, but in academese, a language which frequently borrows English words, but changes their meanings and which otherwise has no similarity to plain English. Luckily for you, dear reader, I have been trained in academese and can translate the abstract for you:
When men look at naked women, their brains get excited and they have thoughts of getting lucky. When men see naked women, they get distracted and cannot concentrate on the tasks at hand. When we showed a group of men pictures of nearly naked women, they lost patience with a betting game we tried playing with them. The hornier the men were the less they were interested in our game, and in anything else we had to say. After a while, the men got bored of looking at the same women and wanted to move on.
As I said, this is ground-breaking research as it brings to light relationships of men to naked women never before suspected.
Rumor has it the three researchers, who are from Belgium, plan on studying the effects of increasing dosages of the C2H4OH molecule on men’s perception of female attractiveness. I for one, cannot wait to find out.
June 12th, 2008
I have never, and will never, read Vanity Fair. Given our culture is already saturated, more mindless celebrity tittle tattle written by besotted suck-ups I do not need. So I missed the piece on Bill Clinton that suggested he might have suffered from a malady called “pump head”, brought on by his heart surgery.
Melinda Back, at the Wall Street Journal, wrote an article on this subject today (I have no idea how long that link will be good) which alerted me to the topic.
When surgeons cut a guy open to chop away at his heart, they usually stop it from beating (presumably, this makes it less slippery). They then hook up a machine, a pump, to oxygenate and circulate the patient’s blood. Some people are concerned that the machine, which is certainly necessary, causes harm, usually mental degradation, to those patients who live through the surgery. Lots of mechanisms have been proposed which might cause this harm, but there is no agreement or even direct evidence that any of them actually do cause harm.
“Pump head”, not to put too fine a point on it, is bunk.
The first “diagnosing” of this strange malady came from a series of experiments that gave people before- and after-surgery mental exams. The researchers found that a certain proportion of people scored worse on the after-surgery tests, which confirmed the idea that people get dumber after having been on the pump.
To show this, they created a conglomeration of the tests that were given using a dicey statistical technique called “factor analysis,” a method with which it is far too easy to generate spurious results. But even given that this method was applied properly and conservatively, there is still a large, glaring error in these analyses.
It is true that some people scored worse on the conglomeration-test after surgery. This is the sole evidence for “pump head.” But it is also true that some people scored better! In fact, the same exact proportion of people who scored worse, scored better. This means you could just as easily write a paper suggesting open-heart surgery as a method to boost IQ!
The problem was that the original researchers never bothered to look for people who scored better, only those who scored worse; they only examined those patients who looked like what they hoped they would look like, that is, those who seemed to get dumber.
What’s really going on is nothing more than the banal phenomena of “regression to the mean.” If you take a test, some days you will do better, other days worse. Everybody has a natural background variability. Now, if you do score high one day, chances are that the next time you take the test, you will achieve only your average performance. Same thing if you first tested low: next time, you’re likely to improve.
If you look at a bunch of people who take the test, and create two groups, one with those who scored high and another with those who scored low, and then later re-test both groups the high group will show lower scores on the re-test, and the low group will show higher scores. It is impossible for the situation to be other than this.
This phenomena is a boon to researchers who want to prove spurious effects, because, as I said, it is impossible for it not to manifest itself. You can prove the efficacy of or show the potential harm of absolutely any therapy this way.
So pump head, so far as it has been demonstrated in tests like these, is nonsense.
This means that Bill Clinton is probably no dumber now than he was before.
June 10th, 2008
Thanks to a hot tip from Lucia, over at the Diet Diary, I have become wiser about spam. I installed the wp-spamfree plug-in and we’ll see how that works.
OLD “I have been getting an enormous amount of spam over the past week (1000s of postings a day; all caught by the spam filter), so I am shutting off comments for 24 hours in the hope this will get me off some spam lists. Sorry for the inconvenience. “
May 22nd, 2008
There are several global climate models (GCMs) produced by many different groups. There are a half dozen from the USA, some from the UK Met Office, a well known one from Australia, and so on. GCMs are a truly global effort. These GCMs are of course referenced by the IPCC, and each version is known to the creators of the other versions.
Much is made of the fact that these various GCMs show rough agreement with each other. People have the sense that, since so many “different” GCMs agree, we should have more confidence that what they say is true. Today I will discuss why this view is false. This is not an easy subject, so we will take it slowly.
Suppose first that you and I want to predict tomorrow’s high temperature in Central Park in New York City (this example naturally works for any thing we want to predict, from stock prices to number of people who will vote for a certain USA presidential candidate). I have a weather model called MMatt. I run this model on my computer and it predicts 66 degrees F. I then give you this model so that you can run it on your computer, but you are vain and rename the model to MMe. You make the change, run the model, and announce that MMe predicts 66 degrees F.
Are we now more confident that tomorrow’s high temperature will be 66 because two different models predicted that number?
Obviously not.
The reason is that changing the name does not change the model. Simply running the model twice, or a dozen, or a hundred times, does not give us any additional evidence than if we only ran it just once. We reach the same conclusion if instead of predicting tomorrow’s high temperature, we use GCMs to predict next year’s global mean temperature: no matter how many times we run the model, or how many different places in the world we run it, we are no more confident of the final prediction than if we only ran the model once.
So Point One of why multiple GCMs agreeing is not that exciting is that if all the different GCMs are really the same model but each just has a different name, then we have not gained new information by running the models many times. And we might suspect that if somebody keeps telling us that “all the models agree” to imply there is greater certainty, he either might not understand this simple point or he has ulterior motives.
Are all the many GMCs touted by the IPCC the same except for name? No. Since they are not, then we might hope to gain much new information from examining all of them. Unfortunately, they are not, and can not be, that different either. We cannot here go into detail of each component of each model (books are written on these subjects), but we can make some broad conclusions.
The atmosphere, like the ocean, is a fluid and it flows like one. The fundamental equations of motion that govern this flow are known. They cannot differ from model to model; or to state this positively, they will be the same in each model. On paper, anyway, because those equations have to be approximated in a computer, and there is not universal agreement, nor is there a proof, of the best way to do this. So the manner each GCM implements this approximation might be different, and these differences might cause the outputs to differ (though this is not guaranteed).
The equations describing the physics of a photon of sunlight interacting with our atmosphere are also known, but these interactions happen on a scale too small to model, so the effects of sunlight must be parameterized, which is a semi-statistical semi-physical guess of how the small scale effects accumulate to the large scale used in GCMs. Parameterization schemes can differ from model to model and these differences almost certainly will cause the outputs to differ.
And so on for the other components of the models. Already, then, it begins to look like there might be a lot of different information available from the many GCMs, so we would be right to make something of the cases where these models agree. Not quite.
The groups that build the GCMs do not work independently of one another (nor should they). They read and write for the same journals, attend the same conferences, and are familiar with each other’s work. In fact, many of the components used in the different GCMs are the same, even exactly the same, in more than one model. The same person or persons may be responsible, through some line of research, for a particular parameterization used in all the models. Computer code is shared. Thus, while there are some reasons for differing output (and we haven’t covered all of them yet), there are many more reasons that the output should agree.
Results from different GCMs are thus not independent, so our enthusiasm generated because they all roughly agree should at least be tempered, until we understand how dependent the models are.
This next part is tricky, so stay with me. The models differ in more ways than just the physical representations previously noted. They also differ in strictly computational ways and through different hypotheses of how, for example, CO2 should be treated. Some models use a coarse grid point representation of the earth and others use a finer grid: the first method generally attempts to do better with the physics but sacrifices resolution, the second method attempts to provide a finer look at the world, while typically sacrificing accuracy in other parts of the model. While the positive feedback in temperature caused by increasing CO2 is the same in spirit for all models, the exact way it is implemented in each can differ.
Now, each climate model, as a result of the many approximations that must be made, has, if you like, hundreds (even thousands) of knobs that can be dialed to and fro. Each twist of the dial produces a difference in the output. Tweaking these dials, then, is a necessary part of the model building process. The models are tuned so that they, as closely as possible, first are able to produce climate that looks like the past, already observed, climate. Much time is spent tuning and tweaking the models so that they can, at least roughly, reproduce past climate. Thus, the fact that all the GCMs can roughly represent the past climate is again not as interesting as it first seemed. They better had, or nobody would seriously consider the model as a contender.
Reproducing past data is a necessary but not sufficient condition that the models can predict future data. Thus, it is also not at all clear how these tweakings affect the accuracy in predicting new data, which is data that was not used in any way to build the models, that is, future data. Predicting future data has several components.
It might be that one of the models, say GCM1 is the best of the bunch in the sense that it matches most closely future data. If this is always the case, if GCM1 is always closest (using some proper measure of skill), then it means that the other models are not as good, they are wrong in some way, and thus they should be ignored when making predictions. The fact that they come close to GCM1 should not give us more reason to believe the predictions made by GCM1. The other models are not providing new information in this case. This argument, which is admittedly subtle, also holds if a certain group of GCMs are always better than the remainder of models. Only the close group can be considered independent evidence.
Even if you don’t follow—or believe—that argument, there is also the problem of how to quantify the certainty of the GCM predictions. I often see pictures like this:

Each horizontal line represents the output of a GCM, say predicting next year’s average global temperature. It is often thought that the spread of the outputs can be used to describe a probability distribution over the possible future temperatures. The probability distribution is the black curve drawn over the predictions, and neatly captures the range of possibilities. This particular picture looks to say that there is about a 90% chance that the temperature will be between 10 and 14 degrees. It is at this point that people fool themselves, probably because the uncertainty in the forecast has become prettily quantified by some sophisticated statistical routines. But the probability estimate is just plain wrong.
How do I know this? Suppose that each of the eight GCMs predicted that the temperature will be 12 degrees. Would we then say, would anybody say, that we are now 100% certain in the prediction?
Again, obviously not. Nobody would believe that if all GCMs agreed exactly (or nearly so) that we would be 100% certain of the outcome. Why? Because everybody knows that these models are not perfect.
The exact same situation was met by meteorologists when they tried this trick with weather forecasts (this is called ensemble forecasting). They found two things. The probability forecasts made by this averaging process were far too sure—the probabilities, like our black curve, were too tight and had to made much wider. Second, the averages were usually biased—meaning that the individual forecasts should all be shifted upwards or downwards by some amount.
This should also be true for GCMs, but the fact has not yet been widely recognized. The amount of certainty we have in future predictions should be less, but we also have to consider the bias. Right now, all GCMs are predicting warmer temperatures than are actually occurring. That means the GCMs are wrong, or biased, or both. The GCM forecasts should be shifted lower, and our certainty in their predictions should be decreased.
All of this implies that we should take the agreement of GCMs far less seriously than is often supposed. And if anything, the fact that the GCMs routinely over-predict is positive evidence of something: that some of the suppositions of the models are wrong.
April 8th, 2008
In part I, we learned that all surveys, and in fact all statistical models, are valid only conditionally on some population (or information). We went into nauseating detail of the conditional information on our own survey of people who wear thinking suppression devices (TSDs; see the original posts), so I’ll skip repeating any of it again.
Today, we look at the data and ignore all other questions. The first matter we have to understand is: what are probability models and statistics for? Although we use the data we just observed to fit these models, they are not for that data. We do not need to ask probability questions of the data we just observed, there is no need to. If we want the probability that all the people in our sample wore TSDs, we just look and see if all wore them or not. The probability is 0 or 1, and is 0 or 1 for any other question we can ask about the observed data (e.g. what is the probability that half or more wore them? again, 0 or 1).
Thus, statistics are useful only for making inferences about unobserved data: usually future data, but really just unknown to you. If you want to make statements or quantify uncertainty in data you have not yet seen, then you need probability models. Some would say statistics are useful for making inferences about unobserved and unobservable parameters, but I’ll try to dissuade you of that opinion in this essay. We have to start, however, with describing what these parameters are and why so much attention is devoted to them.
Before we do, we have to return to our question, which was roughly phrased in English as “How many people wear TSDs?”, and we have to turn it into a mathematical question. We do this by forming a probability model for the English question. If you’ve read some of my earlier posts, you might recall that we have an essentially infinite choice of models which we could use. What we would like is if we could limit our choice to a few or, best of all, to logically deduce the exact model given some set of information that we believe true.
Here is one such statement: M1 = “The probability that somebody wears a TSD (at the locations and times specified for our for our exactly defined population subset) is fixed, or constant, and knowing whether one person wears a TSDs gives us no information whether any other person wears a TSD.” (Whenever you see M1, substitute the sentence “The probability…”)
Is M1 true? Almost certainly not. For example, if two people walk by our observation spot together, say a couple, it might be less likely for either to wear a TSD than it is for two separate people. Again people (not all people, anyway) aren’t going to wear a TSD at all hours equally often, and not equally often at all locations within our subset either.
But let’s suppose that M1 is true anyway. Why? Because this is what everybody else does in similar situations, which they do because it allows them to write a simple and familiar probability model for the number of people x out of n wearing TSDs. Here is the model for the data we just observed:
Pr( x = k | n, θ, M1)
This is actually just a script or shorthand for the model, which is some mathematical equation (binomial distribution), and not of real interest; however it is useful to learn how to read the script. From left to right, it is the probability that the number of people x equals some number k given we know n, something called θ, and M1 is true. This is the mathematical way of writing the English question.
The variable x is more shorthand meaning “number of people who wore a TSD”. Before we did our experiment, we did not know the value of x, so we say it was “random.” After we see the data we know k, the actual number of new people out of the n people we saw who did wear a TSD. OK so far? We already understand what M1 is, so all that is left to explain is θ What is it?
It is a parameter, which if you recall previous posts, is an unobservable, unmeasurable number, but which is necessary in order to formulate our probability model. Some people incorrectly call θ “the probability that a single person wears a TSD.” This is false and is an example of the atrocious and confusing terminology so often used in statistics (look in any introductory text and you’ll see what I mean). θ, while giving the appearance of one, is no sort of probability at all. It would be a probability if we knew its value. But we do not: and if we did know, we never would have bothered collecting data in the first place! Now, look carefully. θ is written on the right hand side of the “|”, which is where we put all the stuff that we believe we know, so again it looks as if we are saying we know θ, so it looks like a probability.
But this is because the model is incomplete. Why? Remember that we don’t really need to model the observed data if that is all we are interested in. So the model we have written is only part of a model for future data. There are several pieces that are missing. Those pieces are another probability model for the value of θ, a model for just the observed data, a model for the uncertainty in θ given the observed data, the data model itself again, which are all mathematically manipulated to produce this creature
Pr( xnew = knew | nnew, xold, nold, M1)
which is the true probability model for new data given what we observed with the old data. There is no way that I can even hope to explain this new model without resorting to some heavy mathematics. This is in part why classical statistics just stops with the fragmentary model, because it’s easier. In that tradition, people create a (non-verifiable) point estimate of θ, which means just plugging some value for θ into the probability model fragment, and then call themselves done.
Well, almost done. Good statisticians will give you some measure of uncertainty of the guess of θ, some plus or minus interval. (If you haven’t already, go back and read the post “It depends on what the meaning of mean means.”) The classical estimate used for θ is just the computed mean, the average of the past data. So the plus and minus interval will only be for the guess of the mean. In other words, just as it was in regression models, it will be too narrow and people will be overconfident when predicting new data.
All this is very confusing, so now—finally!—was can return to the data collected by those folks who turned in their homework and work through some examples.
There were 6 separate collections, which I’ll lump together with the clear knowledge that this violates the limits of our population subset (two samples were taken in foreign countries, one China and one New Jersey). This gave x = 58 and n = 635.
The traditional estimate of θ is 58/635 = 0.091, with the plus minus interval of 0.07 to 0.12. Well, so what? Remember that our goal is to estimate the number of people who wear TSDs, so this classical estimate of θ is not of much use.
If we just plug in the best estimate of θ to estimate, out of 300 million (the approximate population of the U.S.A.), how many wear TSDs, we get a guess of 27.4 million with a plus-minus window of 27.39 to 27.41 million, which is a pretty tight guess! The length of that interval is only about 20,000 people wide. This is being pretty sure of ourselves, isn’t it?
If we use the modern estimate, we get a guess of 25.5 million, with a plus-minus window of about 19.3 to 31.7 million, which is much wider and hence more realistic. The length of this interval is 12.4 million! Why is this interval so much larger? It’s because we took full account of our uncertainty in the guess of θ, which the classical plug-in guess did not (we essentially recompute a new guess for every possible value of θ and weight them by the probability that θ equals each value: but that takes some math).
Perhaps these numbers are too large to think about easily, so let’s do another example and ask how many people riding a car on the F train wear a TSD. The car at rush hour holds, say, 80 people. The classical guess is 7, with +/- of 3 to 13. The modern guess is also 7 with +/- of 2 to 12. Much closer to each other, right?
Well, how about all the students in a typical college? There might be about 20,000 students. The classical guess is 1750 with +/- 1830 to 1910. The modern is 1700 with +/- 1280 to 2120.
We begin to see a pattern. As the number of new people increases, the modern guess becomes a little lower than the classical one, and the uncertainty in the modern guess is realistically much larger. This begins to explain, however, why so many people are happy enough with the classical guesses: many samples of interest will be somewhat small, so all the extra work that goes into computing the modern estimate doesn’t seem worth it.
Unfortunately, that is only true because we had such a large initial data collection. If, for example, we only had Steve Hempell’s, which was x = 1 and n = 41, and we were interested still in the F train, then the classical guess is 2 with +/- 0 to 5; and the modern guess 4 +/- 0 to 13! The difference between the two methods is again large enough to make a difference.
Once again, we have done a huge amount of work for a very, very simple problem. I hope you have read this far, but I would not have blamed you if you hadn’t because, I am very sorry to say, we are not done yet. Everybody who remembers M1 raise their hands? Not too many. Yes, all these guesses were conditional on M1 being true. What if it isn’t? At the least, it means that the guesses we made are off a little and that we must widen our plus-minus intervals to take into account our uncertainty in the correctness of our model.
Which I won’t do because I am, and you are probably, too fatigued. This is a very simple problem, like I said. Imagine problems with even more complicated statistics where uncertainty comes at you from every direction. There the differences between the classical and modern way are even more apparent. Here is the second answer for our homework:
- Too many people are far too certain about too many things
March 21st, 2008
A couple of days ago I gave out homework. I asked my loyal readers to count how many people walked by them and to keep track of how many of those people wore a thinking-suppression device like an I-pod etc. Like every teacher, my heart soared like a hawk when some of the students actually completed the task. Visit the original thread’s comments to see the “raw” data.
The project was obviously to recreate a survey of the kind which we see daily: e.g. What percent of Americans favor a carbon tax? What fraction of the voters want “change”? How many prefer Brand A? And so on.
Here is how a newspaper might present the results from our survey:
More consumers are endangering their hearing than ever before, according to new research by WMBriggs.com. Over 20% of consumers now never leave the house without an I-pod or I-pod-like device.
“Music is very popular” said Dr Briggs, “And now it’s easier than ever before to listen to it.” This might help explain the rise in tinnitus reports, according to some sources. Dr So Undzo of the Send Us Money to Battle Tinnitus Foundation was quoted as saying, “Blah blah blah.” He also said, “Blah blah blah blah blah.” &tc. &tc.
Despite its farcical nature, this “news” report is no different than the dozens that show up on TV, the radio, and everywhere else. In order to tell a newsworthy story, it extrapolates wildly from the data at hand, it gives you no idea who collected the original data or why (for money? for notoriety?) or how (by observation? by interview?), or of any of the statistical methods used to manipulate the data. In short: it is very nearly worthless. The only advantage a story like this has is that it can be written before any data is actually taken, saving time and money to the news organization issuing it.
But you already knew all that. So let’s talk about the real problem with statistics. Beware, however, that some of this is dull labor, requiring attention to detail, and probably too much work for too little content. However, that’s how the get you, by hoping you pass by quickly and say “close enough.”
We had five to six responses to the homework so far, but we’ll start with the first one from Steve Hempell. He saw n=41 people and counted m=1 wearing a thinking-suppression device (TSD). He sat on a bench in a small town during spring break to watch citizens pass by.
The first thing we need to have securely in our minds is what question we want to answer with this data. The obvious one is “How many people regularly wear a TSD?” This innocent query begins our troubles.
What do we mean by “people”? All people? There are a little over 6 billion humans now. Do we want an estimate from that group? What about historical, i.e. dead, people, or those yet to be born? How far back into the future or past do we want to go? Are we talking of people “now”? Maybe, but we still have to define “now”: does it mean in a year or two, or just the day the survey was taken or a few days into the future? Trivial details? Well, we’ll see. Let’s settle on the week after the survey was taken so that our question becomes “How many people in the week after our survey was taken regularly wear a TSD?”
We’re still not done with “people” and haven’t decided whether it was all humans or some subset. The most common subset is “U.S. Americans” (as Miss Teen South Carolina would have phrased it). But all U.S. citizens? Presumably, infants do not wear TSDs, nor do many in nursing homes or in other incarcerations. Were infants even counted in the survey? Older people in general, experience tells us, do not often wear TSDs. As I think about this question, I find myself unable to rigorously quantify the subset of interest. If I say “All U.S. citizens” then my eventual estimate would probably be too high, given this small sample. If I say, “U.S. citizens between the ages of 15 and 55″ then I might do better, but the survey is of less interest.
To pick something concrete, we’ll go with “All U.S. citizens” which modifies our question to “How many U.S. citizens in the week after our survey was taken regularly wear a TSD?”
Sigh. Not done yet. We still have to tackle “regularly” and the bigger question of whether or not our sample represents fairly the population we have in mind, and would still leave the largest, most error-prone area: what exactly is an TSD? I-pods were identified, but how about cell phones or Blackberries and on and on? Frankly, however, I am bored.
Like I said, though, boredom is the point. No one wants to invest as much time as we have for this simple survey to each survey they meet. No matter how concrete the appropriate population in a survey seems to you, it can mean something entirely different to somebody else; each person can take away their own definition. This ambiguity, while frustrating to me, is gold to marketers, pollsters, and “researchers.” So vaguely worded are surveys that the reader can supply any meaning they want to its results. Although they usually consciously aware of it, people read surveys like they read horoscopes or psychic readings: they always seem accurate or to confirm people’s worst fears or hopes.
An objection might have occurred to you. “Sure, these complex surveys are ambiguous. But there are simple polls that are easy to understand. The best example is ‘Who will you vote for, Candidate A or B?’ Not much to confuse there.”
You mean, since a poll is a prediction of ballot results, besides trusting that the pollster found a population representative of people who will actually vote on election day? That no event between the time the poll was taken and the election occurs that will cause people to change their minds? And—pay attention here—nobody lied to the pollster?
“Oh, too few people lie to make a difference.” Yeah? Well, I live in New York City and I like to tell the story of the exit polls taken for the presidential race between Kerry and Bush. Those polls had Kerry ahead by about 10 to 1, a non-surprising result, and one which confirmed people’s prior beliefs. The pollsters asked tons of voters and were spread throughout the city in an attempt to obtain the most representative sample they could. Not everybody would answer them, of course, and that is still another problem which is impossible to tackle.
But when the actual results were tallied, Kerry won by only a margin about a little under 5 to 1. Sure, he still won, but the real shocker is that so many people lied to the pollster. And why? Well, this is New York City, and in Manhattan particularly, you just cannot easily admit to being a Bush supporter (then or now). At the least, doing so invites ridicule, and who needs that? Simpler just to lie and say, “I voted for Kerry.”
We have done a lot and we still haven’t answered the question of how to handle the actual data!
Here are the answers to part I of the homework.
- The applicability of all surveys is conditional on a population which must be, though rarely is, rigorously defined.
- All surveys have significant measurement error that has nothing to do with the actual numerical data.
- Because of this, people are too certain when reading or interpreting the results of surveys
In part II, if we are not already worn down, we will learn how to—finally!—handle the data.
March 20th, 2008
I was reminded of this homework problem that I give my students as I was riding in on the F train this morning. It is a very good problem because it is exceedingly simple and nicely demonstrates two problems of the classical way of looking at statistics.
All you need to do this homework is a busy place and some free time, about 20 minutes.
Find a spot where people congregate or pass by. Be sure to carefully and concretely specify this place: keep its boundaries fixed and rigid for the duration of the homework.
Count the people in the spot, either all at once, or as they pass by for some fixed time (decided in advance). Also count the number of people who are wearing some sort of thinking-suppression device. There are obviously any number of other things you can take note of, like sex, age, etc., but we’ll ignore all of them.
Report back to me (in the Comments) the two numbers, number wearing thinking-suppression devices, which will be less than or equal to the total number of people. Also note details of your spot.
We are obviously going to be talking about forming ratios and estimating probabilities. I’ll discuss what all this means–and what it does not mean—once a few people have turned in the assignment.
Oh, yes. A thinking-suppression device is anything like an I-Pod, MP3 player, etc. etc.
March 17th, 2008
I often say—it is even the main theme of this blog—that people are too certain. This is especially true when people report results from classical statistics, or use classical methods when implementing modern, Bayesian theory. The picture below illustrates exactly what I mean, but there is a lot to it, so let’s proceed carefully.
Look first only at the jagged line, which is something labeled “Anomaly”; it is obviously a time series of some kind over a period of years. This is the data that we observe, i.e. that we can physically measure. It, to emphasize, is a real, tangible thing, and actually exists independent of whatever anybody might think. This is a ridiculously trivial point, but it is one which must be absolutely clear in your mind before we go on.
I am interested in explaining this data, and by that I mean, I want to posit a theory or model that says, “This is how this data came to have these values.” Suppose the model I start with is
A: y = a + b*t
where y are the observed values I want to predict, a and b are something called parameters, and t is for time, or the year, which goes from 1955 to 2005. Just for fun, I’ll plug in some numbers for the parameters so that my actual model is
A’: y = -139 + 0.07*t
The result of applying model A’ gives the little circles. How does this model fit?

Badly. Almost never do the circles actually meet with any of the observed values. If someone had used our model to predict the observed data, he almost never would have been right. Another way to say this is
Pr(y = observed) ~ 0.04
or the chance that the model equals the observed values is about 4%.
We have a model and have used it to make predictions, and we’re right some of the time, but there is still tremendous uncertainty in our predictions left. It would be best if we could quantify this uncertainty so that if we give this model to someone to use, they’ll know what they are getting into. This is done using probability models, and the usual way to extend our model is called regression, which is this
B: y = a + b*t + OS
where the model has the same form as before except for the addition of the term OS. What this model is saying is that “The observed values exactly equal this straight line plus some Other Stuff that I do no know about.” Since we do not know the actual values of OS, we say that they are random.
Here is an interesting fact: model A, and its practical implementation A’, stunk. Even more, it is easy to see that there are no values of a and b that can turn model A into a perfect model, for the obvious reason that a straight line just does not fit through this data. But model B always can be made to fit perfectly! No matter where you draw a straight line, you can always add to it Other Stuff so that it fits the observed series exactly. Since this is the case, restrictions are always placed on OS (in the form of parameters) so that we can get some kind of handle on quantifying our uncertainty in it. That is a subject for another day.
Today, we are mainly interested in finding values of a and b so that our model B fits as well as possible. But since no straight line can fit perfectly, we will weaken our definition of “fit” to say we want the best straight line that minimizes the error we make using that straight line to predict the observed values. Doing this allows us to guess values of a and b.
Using classical or Bayesian methods of finding these guesses leads to model A’. But we are not sure that the values we have picked for a and b are absolutely correct, are we? The value for b might have been 0.07001, might it not? Or a might have been -138.994.
Since we are not certain that our guesses are perfectly correct, we have to quantify our uncertainty in them. Classical methodology does this by computing a p-value, which for b is 0.00052. Bayesian methodology does this by computing a posterior probability of b > 0 given the data, which is 0.9997. I won’t explain either of these measures here, but you can believe me when I tell you that they are excellent, meaning that we are pretty darn sure that our guess of b is close to its true value.
Close, but not exactly on; nor is it for a, which means that we still have to account for our uncertainty in these guesses in our predictions of the observables. The Bayesian (and classical1) way to approximate this is shown in the dashed blue lines. These tell us that there is a 95% chance that the expected value of y is between these lines. This is good news. Using model B, and taking account of our uncertainty in guessing the parameters, we can then say the mean value of y is not just a fixed number, but a number plus or minus something, and that we are 95% sure that this interval contains the actual mean value of y. And that interval looks pretty good!
Time to celebrate! No, sorry, it’s not. There is one huge thing still wrong with this model: we cannot ever measure a mean. The y that pops out of our model is a mean and shares a certain quality with the parameters a and b, which is that they are unobservable, nonphysical quantities. They do not exist in nature; they are artificial constructs, part of the model, but you will never find a mean(y), a, or b anywhere, not ever.
Nearly all of statistics, classical and Bayesian, focuses its attention on parameters and means and on making probability statements about these entities. These statements are not wrong, but they are usually beside the point. A parameter almost never has meaning by itself. Most importantly, the probability statements we make about parameters always fool us into thinking we are more certain than we should be. We can be dead certain about the value of a parameter, while still being completely in the dark about the value of an actual observable.
For example, for model B, we said that we had a nice, low p-value and a wonderfully high posterior probability that b was nonzero. So what? Suppose I knew the exact value of b to as many decimal places as you like. Would this knowledge also tell us the exact value of the observable? No. Well, we can compute the confidence or credible interval to get us close, which is what the blue lines are. Do these blue lines encompass about 95% of the observed data points? They do not: they only get about 20%. It must be stressed that the 95% interval is for the mean, which is itself an unobservable parameter. What we really want to know about is that data values themselves.
To say something about them requires a step beyond the classical methods. What we have to do is to completely account for our uncertainty in the values of a and b, but also in the parameters that make up OS. Doing that produces the red dashed lines. These say, “There is a 95% chance that the observed values will be between these lines.”
Now you can see that the prediction interval—which is about 4 times wider than the mean interval—is accurate. Now you can see that you are far, far less certain than what you normally would have been had you only used traditional statistical methods. And it’s all because you cannot measure a mean.
In particular, if we wanted to make a forecast for 2006, one year beyond the data we observed, the classical method would predict 4.5 with interval 3.3 to 5.7. But the true interval for the prediction of the interval, while still 4.5, has the interval 0.5 to 9, which is three and a half times wider than the previous interval.
…but wait again! (”Uh oh, now what’s he going to do?”)
These intervals are still too narrow! See that tiny dotted line that oscillates through the data? That’s the same model as A’ but with a sine wave added on to it, to account for possibly cyclicity of the data. Oh, my. The red interval we just triumphantly created is true given that model B is true. But what if model B was wrong? Is there any chance that it is? Of course there is. This is getting tedious—which is why so many people stop at means—but we also, if we want to make good predictions, have to account for our uncertainty in the model. But we’re probably all exhausted by now, so we’ll save that task for another day.
1Given the model and priors I used, this is true.
March 9th, 2008
I am, of course, a statistician. So perhaps it will seem unusual to you when I say I wish there were fewer statistics done. And by that I mean that I’d like to see less statistical modeling done. I am happy to have more data collected, but am far less sanguine about the proliferation of studies based on statistical methods.
There are lots of reasons for this, which I will detail from time to time, but one of the main ones is how easy it is to mislead yourself, particularly if you use statistical procedures in a cookbook fashion. It takes more than a recipe to make an eatable cake.
Among the worst offenders are methods like data mining, sometimes called knowledge discovery, neural networks, and other methods that “automatically” find “significant” relationships between sets of data. In theory, there is nothing wrong with any of these methods. They are not, by themselves, evil. But they become pernicious when used without a true understanding of the data and the possible causal relationships that exist.
However, these methods are in continuous use and are highly touted. An oft-quoted success of data mining was the time a grocery store noticed that unaccompanied men who bought diapers also bought beer. A relationship between data which, we are told, would have gone unnoticed were it not for “powerful computer models.”
I don’t want to appear too negative: these methods can work and they are often used wisely. They can uncover previously unsuspected relationships that can be confirmed or disconfirmed upon collecting new data. Things only go sour when this second step, verifying the relationships with independent data, is ignored. Unfortunately, the temptation to forgo the all-important second step is usually overwhelming. Pressures such as cost of collecting new data, the desire to publish quickly, an inflated sense of certainty, and so on, all contribute to this prematurity.
Stepwise
Stepwise regression is a procedure to find the “best” model to predict y given a set of x’s. The y might be the item most likely bought (like beer) given a set of possible explanatory variables x, like x1 sex, x2 total amount spent, x3 diapers purchased or not, and on and on. The y might instead be total amount spent at a mall, or the probability of defaulting on a loan, or any other response you want to predict. The possibilities for the explanatory variables, the x’s, are limited only to your imagination and ability to collect data.
A regression takes the y and tried to find a multi-dimensional straight line fit between itself and the x’s (e.g., a two-dimensional straight line is a plane). Not all of the x’s will be “statistically significant1“; those that are not are eliminated from the final equation. We only want to keep those x’s that are helpful in explaining y. In order to do that, we need to have some measure of model “goodness”. The best measure of model goodness is one which measures how well that model does predicting independent data, which is data that in no way was used to fit the model. But obviously, we do not always have such data at hand, so we need another measure. One that is often picked is the Akaike Information Criterion (AIC), which measures how well the model fits the data that was used to fit the model.
Confusing? You don’t actually need to know anything about the AIC other than that lower numbers are better. Besides, the computer does the work for you, so you never have to actually learn about the AIC. What happens is that many combinations of x’s are tried, one by one, an AIC is computed for that combination, and the combination that has the lowest AIC becomes the “best” model. For example, combination 1 might contain (x2, x17, x22), while combination 2 might contain (x1, x3). When the number of x’s is large, the number of possible combinations is huge, so some sort of automatic process is needed to find the best model.
A summary: all your data is fed into a computer, and you want to model a response based on a large number of possible explanatory variables. The computer sorts through all the possible combinations of these explanatory variables, rates them by a model goodness criterion, and picks the one that is best. What could go wrong?
To show you how easy it is to mislead yourself with stepwise procedures, I did the following simulation. I generated 100 observations for y’s and 50 x’s (each of 100 observations of course). All of the observations were just made up numbers, each giving no information about the other. There are no relationships between the x’s and the y2. The computer, then, should tell me that the best model is no model at all.
But here is what it found: the stepwise procedure gave me a best combination model with 7 out of the original 50 x’s. But only 4 of those x’s met the usually criterion for being kept in a model (explained below), so my final model is this one:
| explan. |
p-value |
Pr(beta x| data)>0 |
x7 |
0.0053 |
0.991 |
x21 |
0.046 |
0.976 |
x27 |
0.00045 |
0.996 |
x43 |
0.0063 |
0.996 |
In classical statistics, an explanatory variable is kept in the model if it has a p-value< 0.05. In Bayesian statistics, an explanatory variable is kept in the model when the probability of that variable (well, of its coefficient being non-zero) is larger than, say, 0.90. Don't worry if you don't understand what any of that means---just know this: this model would pass any test, classical or modern, as being good. The model even had an adjusted R2 of 0.26, which is considered excellent in many fields (like marketing or sociology; R2 is a number between 0 and 1, higher numbers are better).
Nobody, or very very few, would notice that this model is completely made up. The reason is that, in real life, each of these x’s would have a name attached to it. If, for example, y was the amount spent on travel in a year, then some x’s might be x7=”married or not”, x21=”number of kids”, and so on. It is just too easy to concoct a reasonable story after the fact to say, “Of course, x7 should be in the model: after all, married people take vacations differently than do single people.” You might even then go on to publish a paper in the Journal of Hospitality Trends showing “statistically significant” relationships between being married and travel model spent.
And you would be believed.
I wouldn’t believe you, however, until you showed me how your model performed on a set of new data, say from next year’s travel figures. But this is so rarely done that I have yet to run across an example of it. When was the last time anybody read an article in a sociological, psychological, etc., journal in which truly independent data is used to show how a previously built model performed well or failed? If any of my readers have seen this, please drop me a note: you will have made the equivalent of a cryptozoological find.
Incidentally, generating these spurious models is effortless. I didn’t go through 100s of simulations to find one that looked especially misleading. I did just one simulation. Using this stepwise procedure practically guarantees that you will find a “statistically significant” yet spurious model.
1I will explain this unfortunate term later.
2I first did a “univariate analysis” and only fed into the stepwise routine those x’s which singly had p-values < 0.1. This is done to ease the computational burden of checking all models by first eliminating those x’s which are unlikely to be “important.” This is also a distressingly common procedure.
Here is the simulation code, to be run in the free and open source R statistical software:
library(MASS); # need be run only once per session
n<-100;
y<-rnorm(n) # "random" response
X<-matrix(rnorm(n*n/2),n,n/2) # "random" x's
f<-0
for (i in 1:(n/2)){
# univariate analysis; f stores p-values
f[i]<-anova(lm(y~X[,i]))$Pr[1]
}
i<-(f<.1) # only keep x's with p-values < 0.1
w<-data.frame(y,X[,i]) # w is just those x's with p<0.1 and y
fit<-lm(y~.,data=w) # model object to feed into stepwise
fit.aic <- stepAIC(fit) # stepwise
summary(fit.aic) # final model summary
February 25th, 2008
Previous Posts