# Author: Briggs

June 5, 2008 | 5 Comments

## How to Cheat: Stats 101 Chapter 14

I’ve decided to jump ahead a few chapters. Chapters 10 – 13 are very important and cover material that comprises 90% of all the actual statistics that is practiced by civilians. Topics like “testing” and regression—how they are done in classical and Bayesian statistics, why these methods are too sure of themselves, and why observable statistics is the only proper way.

But I can tell that I am testing the patience of my audience, so I will leave these more technical chapters for the book itself.

Thus, here I return to something eminently practical: HOW TO CHEAT WITH STATISTICS.

It is important these days for people to know how to get away with as much as they possibly can. This chapter shows you how to do it.

There are no cheap methods like data fudging or just plain lying—those techniques are for pikers. No: what I give you is genuine, sophisticated gold. Tricks you can actually use and get away with. Tricks that work.

I must be out of my mind to give these secrets away for free, but it is a measure of how much I love you, my audience, my faithful readers.

Only an excerpt is in this posting. To get the whole Chapter, you’ll have to download it. Here is the link.

#### 2. Conditioning

A typical academic study is one, say, that gathers two groups of college kids, maybe about 50 in each set, and has them do some task or asks them to rate something. Another study gathers data from a small area, say a neighborhood in a city, where the sample size may be as high as a few hundred, and asks sociological and economic questions of the people that live there. A medical study might try two treatments in two groups of a hundred or so people. When the data from these studies are in, the results are compiled and papers are published. Claims are made in these papers. The college kids paper will say that people act one way and not another; the city paper will say that poor people have less money; and the medical paper will claim treatment A is better than treatment B.

We already know that if all these researchers wanted to do was to say something about their datasets, then they do not need statistics or probability models. They can look at their data and say, yes, more people got better under treatment A than under treatment B. They would be finished. Evidently, the creators of these studies do not want to make statements only about the past data; they want to imply their findings are more widely applicable.

By far the majority of these kinds of studies, published in academic journals, concern humans. As of this writing, there are over 6.6 billion humans alive, about 100 billion are dead, and God only knows how many more are yet to live. Incidentally, whatever you do, do not mention these facts in your results (unless, of course, you happen to be writing about demography), it will weaken your argument.

Are the results from the college kids study applicable to all humans? All those that lived in the past, those that will live in the future, even those that live now but not in the town in which the college lie? Those who are in their 50s?, 80s? who are less than 10? Poorer people and those with enough money to “get a degree”? (Kids go to college to “get a degree” nowadays, and not usually for anything else. Well, maybe socialization. These are rational choices given the way things are.) Kids at other universities? Let’s be clear: the researchers will gather data on their 100 kids, create a probability model, and since they have read this book, they will not just make a statement about the parameters, but calculate the probability distribution of future observables. The only problem is, about whom do we apply this probability distribution?

Before we answer that, think about the medical trial, which was conducted at a hospital in a city on the East Coast of the United States of America. The physicians also use their data to create a probability distribution of future patients. But who exactly are these patients? People who live in other cities on the east coast?, anywhere in the USA? Canada, too? Or only cities of a certain size? Or do the future patients merely have to “look like” the patients in the old data; that is, be of the same ages, sex ratio, weights, economic condition, have eaten the same things in their lifetimes, traveled to the same places, engaged in the same activities, and so on. Would it have applied to the people who used to be alive, and to people not yet born, indefinitely into the future?

Nobody knows the answers to these questions, which is highly in your favor, especially if you have just completed a study using data “at hand”, that is, that was easy for you to collect. You certainly want to imply that your results are as broadly applicable as possible because this makes you more of an expert than somebody who merely claims to know the habits of a small group of college kids in the year 2008 only, in city C and who are unmarried, between 19 and 22 years old, and whose parents are upper middle class, etc. Openly stressing these limitations might be noble and correct, but it will not get you far. State your results in terms of all people. For example, say “People choose option A over B which gives weight to our theory of psychology.” Do not say, “College kids in our freshman psychology class, who might not be anything like the rest of the population, carried out an experiment for us?and surely they took this task seriously?and…”

Same thing in the medical trial. Emphasize your small p-value, spend more time talking about how the two groups of patients (those that received treatment A and those that got B) were not different than one another. Tell how there were roughly equal numbers of men and women in both treatments, and the same with age, weight, etc. This is an excellent strategy because it is useful information: if the two groups did differ, then your results may be biased. Well, this is a wonderful distraction because it allows you to ignore or downplay the discussion of how your results might only be useful for a small subset of patients.

#### 5. Publishable p-values

Most journals, say in medicine or those serving fields ending with “ology”, are slaves to p-values. Papers have a difficult, if not impossible, time getting published unless authors can demonstrate for their study a p-value that is publishable, that is, that is less than 0.05. Sometimes, the data are not cooperative and the p-value that you get from using a common statistic is too large to see the light of print. This is bad news, because if you are an academic, you must publish papers else you can?t get grants, and if you don?t get grants, then you do not bring money into your university, and if you don?t bring money into your university, you do not get tenure, and if you do not get tenure, then you are out the door and you feel shame.

So small p-values are important. I of course advise against using classical statistics methods, but if you are forced to (and some journal editors insist on it), then all is not lost if an initial large p-value is found. In fact, I would go so far to say that if you cannot find a publishable p-value in any situation, then you are not trying hard enough. There are several ways to lower your p-value.

The most well known is to increase your sample size. This one is a lock. Let?s take a look at the t-test statistic from Chapter 10 to see why.

(see the book)

There is a mathematical phrase that begins “without loss of generality” which I now invoke by letting, for ease of notation, nA = nB = n and s2 = s2 = s2 , so that t(x) becomes

(see the book)

Remember that we want a large statistic, a large t, the larger the better, because larger ts mean smaller p-values. Do you see the trick? A larger n means a larger t! All you have to do is to increase your sample size and just wait for the small p-values to start rolling in. This trick always works in any classical situation, even when the difference xA ? xB is too small to be of interest to anybody. This is why having a small p-value is called attaining statistical significance and not practical or useful significance.

Incidentally, this trick also works in Bayesian statistics in the sense that the posterior distribution of μ A ? μ B will have most probability above or below zero. But it fails miserably in modern observable statistics because a trivial difference in μ A ? μ B won?t make a tinker?s dam worth of difference in the probability distribution of future observables.

The next trick, if you cannot increase your sample size, is to change your statistic. This comes from the useful loophole in classical theory that there is no rule which specifies which statistic you must use in any situation. Thus, though some creativity and willingness to spend time with your statistical software, you can create small p-values where other people see only despair. This isn’t so easy to do in R because you have to know the names of the alternate statistics, but it?s cake in SAS, which usually prints out dozens of statistics in standard cases, which is one reason SAS is worth its exorbitant price. Look around at the advertising brochures of statistical software and you will see that the openly boast of the large number of tests on offer.

For example, for use in “testing differences between proportions”, just off the top of my head I can think of the z statistic, the proportions test with and without correction for continuity (two or three to choose from here), chi-squared test, Fisher’s exact test, McNemar’s test, logistic regression. There are dozens more and teams of academic statisticians constantly add to the pile. Don’t believe it? Here?s a small table of these tests for the TSD/Sex data from Chapter 11.

Test p-value
Prop test 0.78
Fisher’s 0.70
Logistic Reg. 0.52
chi-squared 0.50
z test 0.49
McNemar’s 0.24

Because I was only able to get to 0.24 just means I didn?t try hard enough. Which is the correct p-value? They all are; that?s the beauty of this trick. Not one of these p-values is more “right” than any other one. Each is valid. If all you know is classical statistics, let this knowledge sink in. It should prove to you that p-values are not what you probably thought they were.

For ‘testing differences between means”, there is the t-test (a couple of versions of this, actually), Wilcox test (also called Mann- Whitney), sign tests, Spearman correlation tests, Kendall’s tau, Kruskal-Wallis test, Kolmogorov-Smirnov test, permutation test, Friedman two-way analysis of variance—I’m running out of breath—and many more. Here?s some of those tests for the advertising data:

Test p-value
Spearman 0.87
Perm. 0.20
t-test 0.19
Wilcox 0.14
Kol.-Smi. 0.08

Nearly there!

Please remember that in this example, like the previous one, the data is the same; the only thing that changes is that classical statistical test.

The key to this deceit is to never admit what you did. When it comes time to write up your result boldly and authoritatively state, “We used Johnston’s (Johnston, 1983) frammilax test for differences in means.” Tossing in a citation always cows potential critics; tossing in two or more guarantees editorial acquiesence. Do not tell the reader that you went through a dozen tests to find the lowest p-value. Act as if “Johnston’s test” was what you had in mind all along.

This technique is unavailable in Bayesian or observable statistics. True, you can change your default prior distribution on the parameters or even change the model (see below), but editors in most fields are still suspicious of modern methods and tend to be conservative and will likely insist on a well-known default. There will be more room for creativity in, say, ten years when modern methods become familiar.

Our last option, if you cannot lower your p-value any other way, is to change what is accepted as publishable. So, instead of a p-value of 0.05, use 0.10 and just state that this is the level you consider as statistically significant. I haven?t seen any other number besides 0.10, however, so if your p-value is larger than this the best you can do is to claim that your results are “suggestive” or “in the expected direction.” Don’t scoff, because this sometimes works.

You can really only get away with this in secondary and tertiary journals (which luckily are increasing in number) or in certain fields where the standard of evidence is low, or when your finding is one which people want to be true. This worked for second-hand smoking studies, for example, and currently works for anything negatively associated with global warming.

May 30, 2008 | 4 Comments

## Starting to lose you: Stats 101 Chapter 9

The going started getting tough last Chapter. It doesn’t get any easier here. But stick with it, because once you finish with this Chapter you will see the difference between classical/Bayesian and modern statistics.

Here is the gist:

2. The distribution will have parameters which you do not know
3. Quantify your uncertainty in the parameters using probability
4. Collect observable data ,which will give you your updated information about the parameters which you still do not know and which still have to be quantified by a probability distribution
5. Since you do not care about the parameters, and you do care about future observables, you quantify your uncertainty in these future observables given the uncertainty you still have in the parameters (through the information present in the old data).

If you stop at the parameters, step 4, then you are a regular Bayesian, and you will be too certain of yourself.

This Chapter shows you why. The computer code mentioned in the homework won’t be on-line for a week or so. Again, some of you won’t be able to see all Greek characters, and none of the pictures are given. You have to download the chapter. Here is the link.

### CHAPTER 9

Estimating and Observables

#### 1. Binomial estimation

In the 2007-2008 season, the Central Michigan football team won 7 out of 12 regular season games. How many games will they win in the 2008-2009 season? In Chapter 4, we learned to quantify the probability in this number using a binomial distribution, but we assumed we knew p, the probability of winning any single game. If we do not know p, we can use the old data from last season to help us make a guess about its value. It helps to think of this old data as a string of wins and losses. So that, for the old x, we saw x1 = 0, x2 = 1, . . . , x12 = 1, which we can summarize by k = i xi , where k = 7 is the total number of wins in n = 12 games.

Here’s the binomial distribution written with an unknown parameter

(see the book)

where ? is the success parameter and k the number of successes we observed out of n chances.

How do we estimate ?? Two ways again, a classical and a modern. The classical consists of picking some function of the observed data and calling it ?, and then forming a confidence interval. In R we can get both at once with this function

binom.test(7,12)

where you will see, among other things (ignore those other things for now),

```95 percent confidence interval: 0.2766697 0.8483478 sample estimates: probability of success 0.5833333 ```

This means that ? = 0.58 = 7/12 so again, the estimate is just the arithmetic mean. The 95% confidence interval is 0.28 to 0.84. Easy. This confidence interval has the same interpretation as the one for the ?, which means you cannot say there is a 95% chance that ? is in this interval. You can only say, “either ? is in this interval or it is not.”

Here is Bayes’s theorem again, written as functions like we did for the normal distribution

(see the book)

We know p(k|n, ?, EB ) (this is the binomial distribution), but we need to specify p(?|EB ), which describes what we know about the success parameter before we see any data, given only EB (p(k|n, EB ) will pop out using the same mathematics that gave us p(x|EN ) in equation (17)). We know that ? can be any number between 0 and 1: we also know that it cannot be exactly 0 or 1 (see the homework). Since it can be any number between 0 and 1, and we have no a priori knowledge which number is more likely than any other, it may be best to suppose that each possible value is equally likely. This is the flat prior again (1Like before, there are more choices for this prior distribution, but given even a modest sample size, the differences in the distribution of future observables due to them is negligible). Again, technically EB should be modified to contain this information. After we take the data, we can plot p(?|k, n, EB ) and see the entire uncertainty in ?, or we can pick a ?best? value, which is (roughly) ? = 0.58 = 7/12, or we can say that there is a 95% chance that ? is in the (approximate) interval 0.28 to 0.84. I say “roughly” and “approximate” here, because the classical approximation to the exact Bayesian solution isn?t wonderful for the binomial distribution when the sample size is small. The homework will show you how to compute the precise answers using R.

#### 2. Back to observables

In our hot little hands, we now have an estimate of ? which equals about 0.58. Does this answer the question we started with?

That question was How many games will CMU win in the 2008-2009 season? Knowing that ? equals something like 0.58 does not answer this. Knowing that there is a 95% chance that ? is some number between 0.28 to 0.84 also does not answer the question. This question is not about the unobservable parameter ?, but about the future (in the sense of not yet seen) observable data. Now what? This is one of the key sections in this entire book, so take a steady pace here.

Suppose ? was exactly equal to 0.58. Then how many games will CMU win? We obviously don?t know the exact number even if we knew ?, but we could calculate the probability of winning 0 through 12 games using the binomial distribution, just as we did in Chapters 3 and 4. We could even draw the picture of the entire probability distribution given that ? was exactly equal to 0.58. But ? might not be 0.58, right? There is some uncertainty in its value, which is quantified by p(?|kold , nold , EB ), where now I have put the subscript ?old? on the old data values to make it explicit that we are talking about the uncertainty in ? given previously observed data. The parameter might equal, say, 0.08, and it also might equal 0.98, or any other value between 0 and 1. In each of these cases, given that ? exactly equalled these numbers, we could draw a probability distribution for future games won, or knew given nnew = 12 (12 games next season) and given the value of ?.

Let us draw the probability distribution expressing our uncertainty in knew given nnew = 12 (and EB ) for three different possible values of ?.

(see the book)

If ? does equal 0.08, we can see that the most likely number of games next season is 1. But if ? equals 0.58, the most likely number of games won is 7; while if ? equals 0.98, then CMU will most likely win all their games.

This means that the picture on the far left describes our uncertainty in knew if ? = 0.08. What is the probability that ? = 0.08? We can get it from equation (19), from p(?|kold = 7, nold =12, EB ). The chance of ? = 0.08 is about 1 in 100 million (we’ll learn how the computer does these calculations in the homework). Not very big! This means that we are very very unlikely to have our uncertainty quantified by the picture on the left. What is the chance that ? = 0.98? About 3 in a trillion! Even less likely. How about 0.58? About 3 in 10,000. Still not too likely, but far more likely than either of those other values.

We could go through the same exercise for all the other values that ? could take, each time drawing a picture of the probability distribution of knew . Each one of these would have a certain probability of being the correct probability distribution for the future data, given that its value of ? was the correct value. But since we don?t know the actual value of ?, but we do know the chance that ? takes any value, we can take a weighted sum of these individual probability distributions to produce one overall probability distribution that completely specifies our uncertainty in knew given all the possible values of ?. This will leave us with

(see the book)

Stare at equation (20) for two minutes without blinking. This, in words, is the probability distribution that tells us everything we need to know about future observables knew given that we know there will be nnew chances for success this year, also given that we have seen the past observables kold and nold , and assuming EB is true. Think about this. You do not know what future values of k will be, do you? You do know what the past values are, right? So this is the way to describe your uncertainty in what you do not know given what you do know, taking full account of the uncertainty in ?, which is not of real interest anyway.

The way to get to this equation uses math that is beyond what we can do in this class, but that is unimportant, because the software can handle it for you. This picture shows you what happens. The solid lines are the probability distribution in equation (20). The circles plotted over it are the probability distribution of a regular binomial assuming ? exactly equals 0.58. The key thing to notice is that the circles distribution, which assumes ? ? 0.58 is too tight, too certain. It says the center values of 6 to 8 are more certain than is warranted (their probability is higher than the actual distribution). It agrees, coincidentally only, with the probability that the future number of wins will be 5 or 9, but then gives too little probability for wins less than 5 or greater than 9.

The actual distribution of future observable data (20) will always be wider, more diffuse and spread out, less certain, than any distribution with a fixed ?. This means we must account for uncertainty in the parameter. If we do not, we will be too certain. And if all we do is focus on the parameter, using classical or Bayesian estimates, and we do not think about the future observables, we will be far, far more certain than we should be.

#### 3. Even more observables

Let?s return to the petanque example and see if we can do the same thing for the normal distribution that we just did for the binomial. The classical guess of the central parameter was ? = ?1.8 cm, which matches the best guess Bayesian estimate. The confidence/credible interval was -6.8 cm to 2.8 cm. In modern statistics, we can say that there is a 95% chance that ? is in this interval. We also have a guess for ?, and a corresponding interval, but I didn?t show it; the software will calculate it. We do have to think about ? as well as ?, however?both parameters are necessary to fully specify the normal distribution.

As in the binomial example, we do not know what the exact value of (?, ?) is. But we have the posterior probability distribution p(?, ?|xold , EN ) to help us make a guess. For every particular possible value of (?, ?), we can draw a picture of the probability distribution for future x given that that particular value is the exact value.

(see the book)

The picture shows the probability densities for xnew for three possible values of (?, ?). If (? = ?6.8 cm, ? = 4.4 cm), the most likely values of xnew are around 10 cm, with most probability given to values from -20 cm to 0 cm. On the other hand, if (? = 2.8 cm, ? = 8.4 cm), the most likely values of new x are a little larger than 0 cm, but with most probability for values between -20 cm and 30 cm. If (? = ?1.8 cm, ? = 6.4 cm), future values of x are intermediate of the other two guesses. These three pictures were drawn (using the Advanced code from Chapter 5) assuming that the values of (?, ?) are the correct ones. Of course, they might be the right values, but we do not know that. Instead, each of these three guesses, and every other possible combination of (?, ?), has a certain probability, given xold , of being true.

Given the old data, we can calculate the probability that (?, ?) equals each of these guesses (and equals every other possible combination of values). We can then weight each of the new x distributions according to these probabilities and draw a picture of the distributions of new values given old ones (and the evidence EN ) like we just did for the binomial distribution. This is

(see the book)

Here is a picture of this distribution (generated by the computer, of course)

(see the book)

The solid line is equation (21), and dashed is a normal distribution with (? = ?1.8 cm, ? = 6.4 cm). The two distributions do not look very different, but they certainly are, especially for very large or very small values of xnew . The dashed line is too narrow, giving too much probability for too narrow a range of xnew . In fact, for distribution (21), values greater than 10 cm are from the true distribution are twice as likely as the normal distribution where we plugged in a single guess of (?, ?); values greater than 20 cm are six times as likely. The same thing is repeated for values less than -10 cm, or less than -20 cm, and so on. Go back and read Chapter 6 to refamiliarize yourself with the fact that very small changes in the central or variance parameter can cause large changes in the probability of extreme numbers.

The point again, like in the binomial example, is that using the plug-in normal distribution, the one where you assume you know the exact value of (?, ?), leads you to be far more certain than you really should be. You need to take full account of the uncertainty in your guesses of (?, ?), only then will you be able to full quantify the uncertainty in the future values xnew .

May 27, 2008 | No comments

## Stats 101: Chapter 8

This is where it starts to get complicated, this is where old school statistics and new school start diverging. And I don’t even start the new new school.

Parameters are defined and then heavily deemphasized. Nearly all of old and new school statistics entire purpose is devoted to unobservable parameters. This is very unfortunate, because people go away from a parameter analysis far, far too certain about what is of real interest. Which is to say, observable data. New new school statistics acknowledges this, but not until Chap 9.

Confidence intervals are introduced and fully disparaged. Few people can remember that a confidence interval has no meaning; which is a polite way of saying they are meaningless. In finite samples of data, that is, which are the only samples I know about. The key bit of fun is summarized. You can only make one statement about your confidence interval, i.e. the interval you created using your observed data, and it is this: this interval either contains the true value of the parameter or it does not. Isn’t that exciting?

Some, or all, of the Greek letter below might not show up on your screen. Sorry about that. I haven’t the time to make the blog posting look as pretty as the PDF file. Consider this, as always, a teaser.

### CHAPTER 8

Estimating

#### 1. Background

Let?s go back to the petanque example, where we wanted to quantify our uncertainty in the distance x the boule landed from the cochonette. We approximated this using a normal distribution with parameters m = 0 cm and s = 10 cm. With these parameters in hand, we could easily quantify uncertainty in questions like X = “The boule will land at least 17 cm away” with the formula Pr(X|m = 0 cm, s = 10 cm, EN ) = Pr(x > 17 cm|m = 0 cm, s = 10 cm, EN ). R even gave us the number with 1-pnorm(17,0,10) (about 4.5%). But where did the values of m = 0 cm and s = 10 cm come from?

It was easy to compute the probability of statements like X when we knew the probability distribution quantifying its uncertainty and the value of that distribution?s parameters. In the petanque example, this meant knowing that EN was true and also knowing the values of m and s. Here, knowing means just what it says: knowing for certain. But most of the time we do not know EN is true, nor do we know the values of m and s. In this Chapter, we will assume we do in fact know EN is true. We won?t question that assumption until a few Chapters down the road. But, even given EN is true, we still have to discern the values of its parameters somehow.

So how do we learn what these values are? There are some situations where are able to deduce either some or all of the parameter’s values, but these situations are shockingly few in number. Nearly all the time, we are forced to guess. Now, if we do guess?and there is nothing wrong with guessing when you do not know?it should be clear that we will not be certain that the values we guessed are the correct ones. That is to say, we will be uncertain, and when we are uncertain what do we do? We quantify our uncertainty using probability.

At least, that is what we do nowadays. But then-a-days, people did not quantify their uncertainty in the guesses they made. They just made the guesses, said some odd things, and then stopped. We will not stop. We will quantify our uncertainty in the parameters and then go back to what is of main interest, questions like what is the probability that X is true? X is called an observable, in the sense that it is a statement about an observable number x, in this case an actual, measurable distance. We do not care about the parameter values per se. We need to make a guess at them, yes, otherwise we could not get the probability of X. But the fact that a parameter has a particular value is usually not of great interest.

It isn’t of tremendous interest nowadays, but again, then-a-days, it was the only interest. Like I said, people developed a method to guess the parameter values, made the guess, then stopped. This has led people to be far too certain of themselves, because it?s easy to get confused about the values of the parameters and the values of the observables. And when I tell you that then-a-days was only as far away as yesterday, you might start to be concerned.

Nearly all of classical statistics, and most of Bayesian statistics is concerned with parameters. The advantage the latter method has over the former, is that Bayesian statistics acknowledges the uncertainty in the parameters guesses and quantifies that uncertainty using probability. Classical statistics?still the dominate method in use by non-statisticians1?makes some bizarre statements in order to avoid directly mentioning uncertainty. Since classical statistics is ubiquitous, you will have to learn these methods so you can understand the claims people (attempt to) make.

#### 2. Parameters and Observables

Here is the situation: you have never heard of petanque before and do not know a boule from a bowl from a hole in the ground. You know that you have to quantify x, which is some kind of distance. You are assuming that EN is true, and so you know you have to specify m and s before you can make a guess about any value of x.

Before we get too far, let?s set up the problem. When we know the values of the parameters, like we have so far, we write them in Latin letters, like m and s for the Normal, or p for the binomial. We always write unknown and unobservable parameters as Greek letters, usually ? and ? for the normal and ? for the binomial. Here is the normal distribution (density function) written with unknown parameters:

(see the book)

where ? is the central parameter, and ? 2 is the variance parameter, and where the equation is written as a function of the two unknowns, N(?, ?). This emphasizes that we have a different uncertainty in x for every possible value of ? and ? (it makes no difference if we talk of ? or ? 2 , one is just the square root of the other).

You may have wondered what was meant by that phrase “unobservable parameters” last paragraph (if not, you should have wondered). Here is a key fact that you must always remember: not you, not me, not anybody, can ever measure the value of a parameter (of a probability distribution). They simply cannot be seen. We cannot even see the parameters when we know their values. Parameters do not exist in nature as physical, measurable entities. If you like, you can think of them as guides for helping us understand the uncertainty of observables. We can, for example, observe the distance the boule lands from the cochonette. We cannot, however, observe the m even if we know its value, and we cannot observe ? either. Observables, the reason for creating the probability distributions in the first place, must always be of primary interest for this reason.

So how do we learn about the parameters if we cannot observe them? Usually, we have some past data, past values of x, that we can use to tell us something about that distribution?s parameters. The information we gather about the parameters then tell us something about data we have not yet seen, which is usually future data. For example, suppose we have gathered the results of hundreds, say 200, of past throws of boules. What can we say about this past data? We can calculate the arithmetic mean of it, the median, the various quantiles and so on. We can say this many throws were greater than 20 cm, this many less. We can calculate any function of the observed data we want (means and medians etc. are just functions of the data), and we can make all these calculations never knowing, or even needing to know, what the parameter values are. Let me be clear: we can make just about any statement we want about the past observed data and we never need to know the parameter values! What possible good are they if all we wanted to know was about the past data?

There is only one reason to learn anything about the parameters. This is to make statements about future data (or to make statements about data that we have not yet seen, though that data may be old; we just haven?t seen it yet; say archaeological data; all that matters is that the data is unknown to you; and what does “unknown” mean?). That is it. Take your time to understand this. We have, in hand, a collection of data xold , and we know we can compute any function (mean etc.) we want of it, but we know we will, at some time, see new data xnew (data we have not yet seen), and we want to now say something about this xnew . We want to quantify our uncertainty in xnew , and to do that we need a probability distribution, and a probability distribution needs parameters.

The main point again: we use old data to make statements about data we have not yet seen.

May 22, 2008 | 8 Comments

## Stats 101: Chapter 7

#### Update. I idiotically forgot to put a link. Here it is.

Chapter 7 is Reality. This is usually Chapter 1 in most intro stats books. Those other books invariably start students with topics like “measures of central tendency” and “kinds of experiments” etc. Nothing necessarily wrong with any of this, but the student usually has no idea why he should care about “central tendency” in the first place. Why memorize formulas for means and (population or other) standard deviations? What use are these things in understanding how to quantify uncertainty?

So I put these topics off until the reader realizes that understanding uncertainty is paramount. The whole chapter is nuts and bolts about how to read data into R and do some elementary manipulations. Like Chapter 5, it’s not thrilling reading, but necessary. The homework for 7 asks readers to download a set of R functions at http://wmbriggs.com/book/Rcode.R, but it’s not there yet because I’m still polishing the code.

Some of the formatting is off in the Latex source, but I won’t fix that until I’m happy with the final text. No pictures are here; all are in the book.

### CHAPTER 7

Reality

#### 1. Kinds of data

Somewhere, sometime, somehow, somebody is going to ask you to create some kind of data set (that time is sooner than you think; see the homework). Here is an example of such a set, written as you might see it in a spreadsheet (a good, free open-source spreadsheet is Open Office, www.openoffice.org):

 Q1, …, Sex, Income, Nodules, Ridiculous rust, …, M, 10, 7 , Y taupe, …, F, , 3 , N …. ochre, …, F, 12, 2 , Y

This data is part of a survey asking people their favorite colors (Q1), while recording their sex, annual income, the number of sub-occipital nodules on their brain, and whether or not the interviewee thought the subject ridiculous or not. There is a lot we can learn from this simple fragment.

The first is always use full, readable, English names for the variables. What about Q1, which was indeed the first question on the survey. Why not just call it “Q1”? “Q1” is a lot easier to type than “favorite color”. Believe me, two weeks after you store this data, you will not, no matter how much you swear you will, remember that Q1 was favorite color. Neither will anybody else. And nobody will be able to guess that Q1 means favorite color.

Can you suggest a better name? How about “favcol”, which has fewer letters than “favorite color”, and therefore easier to type? What are you, lazy? You can?t type a few extra letters to save yourself a lot of grief later on?

How about just “favorite color.” Well, not so good either, because why? Because of that space between “favorite” and “color”; most software cannot handle spaces in names. Alternatives are to put underscore or period between words “favorite color”, or “favorite ? color”. Some people like to cram the words together camel style, like “favoriteColor” (the occasional bump of capital letters is supposed to look like a camel: I didn?t name it). Whichever style you choose, be consistent! In any case, nobody will have any trouble understanding that “favoriteColor” means “favorite color”.

Notice, too, that the colors entered under “Q1” use the full English name for the color. Spaces are OK in the actual data, just not in variable names: for example, “burnt orange” is fine. Do not do what many sad people do and use a code for the colors. For example, 1=taupe, 2=envy green, 3=fuschia, etc. What are you trying to do with a code anyway? Hide your work from Nazi spies? Never use codes.

That goes for variables like “Sex”, too. I cannot tell you how many times I have opened up a data set where I have seen Sex coded as “1” and “2”, or “0” and “1”. How can anybody remember which number was which sex? They cannot. And there is no reason too. With data like this, abbreviation is harmless. Nobody, except for the politically correct, will confuse the fact that “M” means male and “F” female. But if you are worried about it, then type out the whole thing.

Similarly for “Ridiculous”, where I have used the abbreviation “Y” for yes and “N” for no. Sometimes a “0” and “1” for “N” and “Y” are acceptable. For example, in the data set we?ll use in a moment, “Vomiting” is coded that way. And, after all, 0/1 is the binary no/yes of computer language, so this is OK. But if there is the least chance of ambiguity for a data value, type the whole answer out. Do not be lazy, you will be saving yourself time later.

It should be obvious, but store numbers as numbers. Height, weight, income, age, etc., etc. Do not use any symbols with the numbers. Store a weight as “213” and not “213 lbs”. If you are worried you will forget that weight is in pounds, name the variable Weight.LBS or something similar.

What if one of your interviewees refused to answer a question? This will often happen for questions like “Income”. How should you code that? Leave his answer blank! For God’s sake, whatever you do, do not think you are being clever and put in some mystery code that, to you, means “missing.” I have seen countless times where somebody thought that putting in a “99” or a “999” for a missing income was a good idea. The computer does not know that 999 means “missing”; it thinks it is just what it looks like—the number 999. So when you compute an average income, that 999 becomes part of the average. Also don?t use a period, the full stop. That?s a holdover from an ancient piece of software (that some people are still forced to use).

There are times when an answer is purposely missing, and a blank should not be used. For example, if “Income” is less than 20000, then the interviewee gets an extra question that people who make more than 20000 do not get. Usually, this kind of rule can be handled trivially in the analysis, but if you want to show that somebody should not have answered and not that they did not answer, then use a code such as “PM” for “purposely missing”. Even better would be to write “purposely missing”, so that somebody who is looking at your data three months down the road doesn?t have to expend a great deal of energy on interpreting what “purposely missing” means.

Try to use a real database to store your data, and keep away from spreadsheets if you can. A real database can be coded so that all possible responses for a variable like ?Race? are pre-coded, eliminating the chance of typos, which are certain to occur in spreadsheets.

Here?s something you don?t often get from those other textbooks, but which is a great truth. You will spend from 80 to 90% of your time, in any statistical analysis just getting the data into the form readable for you and your software. This may sound like the kind of thing you often hear from teachers, while you think to yourself, “Ho, ho, ho. He has to tell us things like that just to give us something to worry about. But it’s a ridiculous exaggeration. I’ll either (a) spend 10-15% of my time, or (b) have somebody do it for me.” I am here to tell you that the answers to these are (a) there is no known way in the universe for this to be true, and (b) Ha ha ha!

#### 2. Databases

The absolute best thing to do is to store you data in a database. I often use the free and open source MySQL (.com, of course). Knowing how to design, set up, and use such a database is beyond what most people want to do on their own. So most, at least for simple studies, opt for spreadsheets. These can be fine, though they are prone to error, usually typos. For instance, the codings “Y” and “Y ” might look the same to you, but they are different inside a computer: one has a space, one doesn’t. The computer thinks these are as different as “Q” and “W”. This kind of typo is extraordinarily common because you cannot see blank spaces easily on a computer screen. To see if you have suffered from it, after you get your data into R type levels(my variable name) and each of the levels, like “Y” and “Y ” will be displayed. If you see something like this, you’ll have to go back to your spreadsheet and locate the offending entries and correct them.

A lot of overhead is built into spreadsheets. Most of it has to do with prettifying the rows and columns?bold headings, colored backgrounds, and so on. Absolutely none of this does anything for the statistical analysis, so we have to simplify the spreadsheet a bit.

The most common way to do this is to save the spreadsheet as a CSV file. CSV stands for Comma Separated Values. It means exactly what it says. The values from the spreadsheet are saved to an ordinary text file (ASCII file), and each column is separated by a comma. An example from one row from the dataset we’ll be using is

`0,0,0,0,39,"black","male","Y",17.1,80,102.4,0`

Note the clever insertion of commas between each value.

What this means is that you cannot actually use commas in your data. For example, you cannot store an income value as “10,000”; instead, you should use “10000”. Also note that there is no dollar sign.

Now, in some countries, where the tendrils of modern society have not yet reached, people unfortunately routinely use commas in place of decimal points. Thus, “3.42” written here is “3,42” written there. You obviously cannot save the later in a CSV file because the computer will think that comma in “3,42” is one of the commas that separates the values, which it does not. The way to overcome this without having to change the data is to change the delimiter to something other than a comma; perhaps a semicolon or a pound sign; any kind of symbol which you know won?t be in the regular data. For example, if you used an @ symbol, your CSV file would look like

` 0@0@0@0@39@"black"@"male"@"Y"@17.1@80@102.4@0`

The only trick will be figuring out how to do this. In Open Office, it?s particularly easy: after opening up the spreadsheet and selecting “Save As”, select the box “Edit Filter settings” and choose your own symbol instead of the default comma. A common mistake is to type an entry into, say, an Opinion variable, where a person’s exact words are the answer. Guard against using a comma in these words else the computer will think you have extra variables: the computer thinks there is a variable between each comma.

#### 3. Summaries

It?s finally time to play with real data. This is, in my experience, another panic point. But it need not be. Just take your time and follow each step. It is quite easy.

The first trick is to download the data onto your computer. Go to the book website and download the file appendicitis.csv and save it somewhere on your hard disk in a place where you can remember. The place where it is is called the path. That is, your hard drive has a sort of hierarchy, a map where the files are stored. In you are on a Windows machine, this is usually the `C:/` drive (yes, the slash is backwards on purpose, because R thinks like a Linux computer, or Apple, which has the slashes the other way). Create your own directory, say, mydata (do not put a space in the name of the folder), and put the appendicitis file there. So the path to the file is `C:/mydata/appendicitis.csv`. Easy, right? If you are on a Linux or Mac, it?s the same idea. The path on a Mac is usually something like `/Users/YOURNAME/mydata/appendicitis.csv`. On a Linux box it might be `/home/YOURNAME/mydata/appendicitis.csv`. Simple!

Open R. Then type this exact command:

` x = read.csv(url("http://wmbriggs.com/book/appendicitis.csv"))`

There is a lot going on here, so let?s go through it step by step. Ignore the `x = `bit for a moment and concentrate on the part that reads `read.csv(...)`. This built-in R function reads a CSV file. Well, what else would you have expected from its name? Inside that function is another one called `url()`, whose argument is the same thing you type into any web browser. The thing you type is called the URL, the Uniform Resource Locater, or web address. What we are doing is telling R to read a CSV file directly off the web. Pretty neat!

If you had saved the file directly to your hard drive, you would have loaded it like this

`x = read.csv("C:/mydata/appendicitis.csv")`

where you have to substitute the correct path, but otherwise is just as easy.

The last thing to know is that when the CSV file is read in it is stored in R?s memory in the object I called x. R calls these objects data frames. Why didn?t they call them data sets? I have no idea. How did I know to use an x, why did I choose that name to store my data? No reason at all except habit. You can call the dataset anything you want. Call it mydata if you want. It just doesn?t matter.

Now type just `x` and hit enter. You?ll see all the data scroll by. Too much to look at, so let?s summarize it:

`summary(x)`

This is data taken on patients admitted to an emergency room with right lower quadrant pain (in the area the appendix is located) in order to find a model to better predict appendicitis (Birkhahn et al., 2006). Each of the variables was thought to have some bearing on this question. We?ll talk more about this data later. Right now, we?re just playing around. When we run the command we get the summary statistics for each variable in x. What it shows is the mean, which is just the arithmetic average of the data, the median, which is the point at which 50% of the data values are larger and 50% smaller, the 1st Qu., which is the first quartile and is the point at which 25% of the data values are smaller, the 3rd Qu. which is the third quartile and is the point at which 75% of the data values are smaller (and 25% are larger, right?). Also given in the Min. which is the minimum value and Max which is the maximum. Last is NA’s, which are the number, if any, of missing values. These kinds of statistics only show for data coded as numbers, i.e. numerical data. For data that is textual, also called categorical or factorial data, the first few levels of categories are shown with a count of the number of rows (observations) that are in that category.

You will notice that variables like Pregnancy are not categorical, but are numerical, which is why we see the statistics and not a category count. Pregnancy is a 0/1 variable and is technically categorical; however, like I said above, it is obvious that “0” means “not pregnant”, so there is no ambiguity. The advantage to storing data in this way is that the numerical mean is then the proportion of people having Pregnancy =1 (think about this!).

Let’s just look at the variable Age for now. It turns out we can apply the summary function on individual variables, and not just on data frames. Inside the computer, the variable age is different than Age (why?). So try `summary(Age)`. What happens? You get the error message `Error in summary(Age) : object "Age" not found.` But it?s certainly there!

You can read lots of different datasets into R at the same time, which is very convenient. I work on a lot of medical datasets and every one of them has the variable Age. How does R know which Age belongs to which dataset? By only recognizing one dataset at a time, through the mechanism of attaching the dataset directly to memory, to R?s internal search path. To attach a dataset, type

`attach(x)`

Yes, this is painful to remember, but necessary to keep different datasets separate. Anyway, try `summary(Age)` again (by using the up arrow on your keyboard to recall previously typed commands) and you’ll see it works.

Incidentally, summary is one of those functions that you can always try on anything in R. You can?t break anything, so there is no harm in giving it a go.