Archive for May 22nd, 2008

May 22 2008

Stats 101: Chapter 7

Published by Briggs under Uncategorized

Update #2. I moronically uploaded a blank document. I have no idea how. It’s all better now.

Update. I idiotically forgot to put a link. Here it is.

Chapter 7 is Reality. This is usually Chapter 1 in most intro stats books. Those other books invariably start students with topics like “measures of central tendency” and “kinds of experiments” etc. Nothing necessarily wrong with any of this, but the student usually has no idea why he should care about “central tendency” in the first place. Why memorize formulas for means and (population or other) standard deviations? What use are these things in understanding how to quantify uncertainty?

So I put these topics off until the reader realizes that understanding uncertainty is paramount. The whole chapter is nuts and bolts about how to read data into R and do some elementary manipulations. Like Chapter 5, it’s not thrilling reading, but necessary. The homework for 7 asks readers to download a set of R functions at http://wmbriggs.com/book/Rcode.R, but it’s not there yet because I’m still polishing the code.

Some of the formatting is off in the Latex source, but I won’t fix that until I’m happy with the final text. No pictures are here; all are in the book.

CHAPTER 7

Reality

1. Kinds of data

Somewhere, sometime, somehow, somebody is going to ask you to create some kind of data set (that time is sooner than you think; see the homework). Here is an example of such a set, written as you might see it in a spreadsheet (a good, free open-source spreadsheet is Open Office, www.openoffice.org):

Q1, …, Sex, Income, Nodules, Ridiculous
rust, …, M, 10, 7 , Y
taupe, …, F, , 3 , N
….
ochre, …, F, 12, 2 , Y

This data is part of a survey asking people their favorite colors (Q1), while recording their sex, annual income, the number of sub-occipital nodules on their brain, and whether or not the interviewee thought the subject ridiculous or not. There is a lot we can learn from this simple fragment.

The first is always use full, readable, English names for the variables. What about Q1, which was indeed the first question on the survey. Why not just call it “Q1″? “Q1″ is a lot easier to type than “favorite color”. Believe me, two weeks after you store this data, you will not, no matter how much you swear you will, remember that Q1 was favorite color. Neither will anybody else. And nobody will be able to guess that Q1 means favorite color.

Can you suggest a better name? How about “favcol”, which has fewer letters than “favorite color”, and therefore easier to type? What are you, lazy? You can?t type a few extra letters to save yourself a lot of grief later on?

How about just “favorite color.” Well, not so good either, because why? Because of that space between “favorite” and “color”; most software cannot handle spaces in names. Alternatives are to put underscore or period between words “favorite color”, or “favorite ? color”. Some people like to cram the words together camel style, like “favoriteColor” (the occasional bump of capital letters is supposed to look like a camel: I didn?t name it). Whichever style you choose, be consistent! In any case, nobody will have any trouble understanding that “favoriteColor” means “favorite color”.

Notice, too, that the colors entered under “Q1″ use the full English name for the color. Spaces are OK in the actual data, just not in variable names: for example, “burnt orange” is fine. Do not do what many sad people do and use a code for the colors. For example, 1=taupe, 2=envy green, 3=fuschia, etc. What are you trying to do with a code anyway? Hide your work from Nazi spies? Never use codes.

That goes for variables like “Sex”, too. I cannot tell you how many times I have opened up a data set where I have seen Sex coded as “1″ and “2″, or “0″ and “1″. How can anybody remember which number was which sex? They cannot. And there is no reason too. With data like this, abbreviation is harmless. Nobody, except for the politically correct, will confuse the fact that “M” means male and “F” female. But if you are worried about it, then type out the whole thing.

Similarly for “Ridiculous”, where I have used the abbreviation “Y” for yes and “N” for no. Sometimes a “0″ and “1″ for “N” and “Y” are acceptable. For example, in the data set we?ll use in a moment, “Vomiting” is coded that way. And, after all, 0/1 is the binary no/yes of computer language, so this is OK. But if there is the least chance of ambiguity for a data value, type the whole answer out. Do not be lazy, you will be saving yourself time later.

It should be obvious, but store numbers as numbers. Height, weight, income, age, etc., etc. Do not use any symbols with the numbers. Store a weight as “213″ and not “213 lbs”. If you are worried you will forget that weight is in pounds, name the variable Weight.LBS or something similar.

What if one of your interviewees refused to answer a question? This will often happen for questions like “Income”. How should you code that? Leave his answer blank! For God’s sake, whatever you do, do not think you are being clever and put in some mystery code that, to you, means “missing.” I have seen countless times where somebody thought that putting in a “99″ or a “999″ for a missing income was a good idea. The computer does not know that 999 means “missing”; it thinks it is just what it looks like—the number 999. So when you compute an average income, that 999 becomes part of the average. Also don?t use a period, the full stop. That?s a holdover from an ancient piece of software (that some people are still forced to use).

There are times when an answer is purposely missing, and a blank should not be used. For example, if “Income” is less than 20000, then the interviewee gets an extra question that people who make more than 20000 do not get. Usually, this kind of rule can be handled trivially in the analysis, but if you want to show that somebody should not have answered and not that they did not answer, then use a code such as “PM” for “purposely missing”. Even better would be to write “purposely missing”, so that somebody who is looking at your data three months down the road doesn?t have to expend a great deal of energy on interpreting what “purposely missing” means.

Try to use a real database to store your data, and keep away from spreadsheets if you can. A real database can be coded so that all possible responses for a variable like ?Race? are pre-coded, eliminating the chance of typos, which are certain to occur in spreadsheets.

Here?s something you don?t often get from those other textbooks, but which is a great truth. You will spend from 80 to 90% of your time, in any statistical analysis just getting the data into the form readable for you and your software. This may sound like the kind of thing you often hear from teachers, while you think to yourself, “Ho, ho, ho. He has to tell us things like that just to give us something to worry about. But it’s a ridiculous exaggeration. I’ll either (a) spend 10-15% of my time, or (b) have somebody do it for me.” I am here to tell you that the answers to these are (a) there is no known way in the universe for this to be true, and (b) Ha ha ha!

2. Databases

The absolute best thing to do is to store you data in a database. I often use the free and open source MySQL (.com, of course). Knowing how to design, set up, and use such a database is beyond what most people want to do on their own. So most, at least for simple studies, opt for spreadsheets. These can be fine, though they are prone to error, usually typos. For instance, the codings “Y” and “Y ” might look the same to you, but they are different inside a computer: one has a space, one doesn’t. The computer thinks these are as different as “Q” and “W”. This kind of typo is extraordinarily common because you cannot see blank spaces easily on a computer screen. To see if you have suffered from it, after you get your data into R type levels(my variable name) and each of the levels, like “Y” and “Y ” will be displayed. If you see something like this, you’ll have to go back to your spreadsheet and locate the offending entries and correct them.

A lot of overhead is built into spreadsheets. Most of it has to do with prettifying the rows and columns?bold headings, colored backgrounds, and so on. Absolutely none of this does anything for the statistical analysis, so we have to simplify the spreadsheet a bit.

The most common way to do this is to save the spreadsheet as a CSV file. CSV stands for Comma Separated Values. It means exactly what it says. The values from the spreadsheet are saved to an ordinary text file (ASCII file), and each column is separated by a comma. An example from one row from the dataset we’ll be using is

0,0,0,0,39,"black","male","Y",17.1,80,102.4,0

Note the clever insertion of commas between each value.

What this means is that you cannot actually use commas in your data. For example, you cannot store an income value as “10,000″; instead, you should use “10000″. Also note that there is no dollar sign.

Now, in some countries, where the tendrils of modern society have not yet reached, people unfortunately routinely use commas in place of decimal points. Thus, “3.42″ written here is “3,42″ written there. You obviously cannot save the later in a CSV file because the computer will think that comma in “3,42″ is one of the commas that separates the values, which it does not. The way to overcome this without having to change the data is to change the delimiter to something other than a comma; perhaps a semicolon or a pound sign; any kind of symbol which you know won?t be in the regular data. For example, if you used an @ symbol, your CSV file would look like

0@0@0@0@39@"black"@"male"@"Y"@17.1@80@102.4@0

The only trick will be figuring out how to do this. In Open Office, it?s particularly easy: after opening up the spreadsheet and selecting “Save As”, select the box “Edit Filter settings” and choose your own symbol instead of the default comma. A common mistake is to type an entry into, say, an Opinion variable, where a person’s exact words are the answer. Guard against using a comma in these words else the computer will think you have extra variables: the computer thinks there is a variable between each comma.

3. Summaries

It?s finally time to play with real data. This is, in my experience, another panic point. But it need not be. Just take your time and follow each step. It is quite easy.

The first trick is to download the data onto your computer. Go to the book website and download the file appendicitis.csv and save it somewhere on your hard disk in a place where you can remember. The place where it is is called the path. That is, your hard drive has a sort of hierarchy, a map where the files are stored. In you are on a Windows machine, this is usually the C:/ drive (yes, the slash is backwards on purpose, because R thinks like a Linux computer, or Apple, which has the slashes the other way). Create your own directory, say, mydata (do not put a space in the name of the folder), and put the appendicitis file there. So the path to the file is C:/mydata/appendicitis.csv. Easy, right? If you are on a Linux or Mac, it?s the same idea. The path on a Mac is usually something like /Users/YOURNAME/mydata/appendicitis.csv. On a Linux box it might be /home/YOURNAME/mydata/appendicitis.csv. Simple!

Open R. Then type this exact command:

x = read.csv(url("http://wmbriggs.com/book/appendicitis.csv"))

There is a lot going on here, so let?s go through it step by step. Ignore the x = bit for a moment and concentrate on the part that reads read.csv(...). This built-in R function reads a CSV file. Well, what else would you have expected from its name? Inside that function is another one called url(), whose argument is the same thing you type into any web browser. The thing you type is called the URL, the Uniform Resource Locater, or web address. What we are doing is telling R to read a CSV file directly off the web. Pretty neat!

If you had saved the file directly to your hard drive, you would have loaded it like this

x = read.csv("C:/mydata/appendicitis.csv")

where you have to substitute the correct path, but otherwise is just as easy.

The last thing to know is that when the CSV file is read in it is stored in R?s memory in the object I called x. R calls these objects data frames. Why didn?t they call them data sets? I have no idea. How did I know to use an x, why did I choose that name to store my data? No reason at all except habit. You can call the dataset anything you want. Call it mydata if you want. It just doesn?t matter.

Now type just x and hit enter. You?ll see all the data scroll by. Too much to look at, so let?s summarize it:

summary(x)

This is data taken on patients admitted to an emergency room with right lower quadrant pain (in the area the appendix is located) in order to find a model to better predict appendicitis (Birkhahn et al., 2006). Each of the variables was thought to have some bearing on this question. We?ll talk more about this data later. Right now, we?re just playing around. When we run the command we get the summary statistics for each variable in x. What it shows is the mean, which is just the arithmetic average of the data, the median, which is the point at which 50% of the data values are larger and 50% smaller, the 1st Qu., which is the first quartile and is the point at which 25% of the data values are smaller, the 3rd Qu. which is the third quartile and is the point at which 75% of the data values are smaller (and 25% are larger, right?). Also given in the Min. which is the minimum value and Max which is the maximum. Last is NA’s, which are the number, if any, of missing values. These kinds of statistics only show for data coded as numbers, i.e. numerical data. For data that is textual, also called categorical or factorial data, the first few levels of categories are shown with a count of the number of rows (observations) that are in that category.

You will notice that variables like Pregnancy are not categorical, but are numerical, which is why we see the statistics and not a category count. Pregnancy is a 0/1 variable and is technically categorical; however, like I said above, it is obvious that “0″ means “not pregnant”, so there is no ambiguity. The advantage to storing data in this way is that the numerical mean is then the proportion of people having Pregnancy =1 (think about this!).

Let’s just look at the variable Age for now. It turns out we can apply the summary function on individual variables, and not just on data frames. Inside the computer, the variable age is different than Age (why?). So try summary(Age). What happens? You get the error message Error in summary(Age) : object "Age" not found. But it?s certainly there!

You can read lots of different datasets into R at the same time, which is very convenient. I work on a lot of medical datasets and every one of them has the variable Age. How does R know which Age belongs to which dataset? By only recognizing one dataset at a time, through the mechanism of attaching the dataset directly to memory, to R?s internal search path. To attach a dataset, type

attach(x)

Yes, this is painful to remember, but necessary to keep different datasets separate. Anyway, try summary(Age) again (by using the up arrow on your keyboard to recall previously typed commands) and you’ll see it works.

Incidentally, summary is one of those functions that you can always try on anything in R. You can?t break anything, so there is no harm in giving it a go.

Continue Reading »

9 responses so far

May 22 2008

Stats 101: Chapter 6

Published by Briggs under Good Statistics, Philosophy

It was one of those days yesterday. I got two chapters up, but did not give anybody a way to get them! Here it is

These are the last two “basics” Chapters. 6 first, and it is a little thin, so I’ll probably expand it later. It’s sort of a transition between probability where we know everything to statistics where we don’t. And by “everything” I mean the parameters of probability models. I want the reader to build up a little intuition before it starts to get rough.

The most important part of 6 is the homework, which I usually spend a lot of time with in class.

In a couple of days we start the good stuff. Book link.

CHAPTER 6

Normalities & Oddities

1. Standard Normal

Suppose x|m, s, EN ? N(m, s), then there turns out to be a trick that can make x easier to work with, especially if you have to do any calculations by hand (which, nowadays, will be rarely). Let z = (x-m)/s, then z|m, s, EN ? N(0, 1). It works for any m and s. Isn’t that nifty? Lots of fun facts about z can be found in any statistics textbook that weighs over 1 pound (these tidbits are usually in the form of impenetrable tables located in the back of the books).

What makes this useful is that Pr(z > 2|0, 1, EN ) ? Pr(z > 1.96|0, 1, EN ) = 0.025 and Pr(z < ?2|0, 1, EN ) ? Pr(z < ?1.96|0, 1, EN ) = 0.025: or, in words, the probability that z is bigger than 2 or less than negative 2 is about 0.05, which is a magic (I mean real voodoo) value in classical statistics. We already learned how to do this in R, last Chapter.

In Chapter 4, a homework question explained the rules of petanque, which is a game more people should play. Suppose the distance the boule lands from the cochonette is x centimeters. We do not know what x will be in advance, and so we (approximately) quantify our uncertainty in it using a normal distribution with parameters m = 0 cm and s = 10 cm. If x > 0 cm it means the boule lands beyond the cochonette, and if x < 0 cm is means the boule lands in front of the cochonette. You are out on the field playing, far from any computer, and the urge comes upon you to discover the probability that x > 30 cm. First thing to do is to calculate z which equals (30cm ? 0cm)/10cm = 3 (the cm cancel). What is Pr(z > 3|0, 1, EN )? No idea; well, some idea. It must be less than 0.025, since we have all memorized that Pr(z > 2|0, 1, EN ) ? 0.025. The larger z is, the more improbable it becomes (right?). Let?s say as a guess 1%. When you get home, you can open R and plug in 1-pnorm(3) and see that the actually probability is 0.1%, so we were off by an order of magnitude (a power of 10), which is a lot, and which proves once again that computers are better at math than we are.

2. Nonstandard Normal

The standard normal example is useful for developing your probabilistic intuition. Since normal distributions are used so often, we will spend some more time thinking about some consequences of using them. Doing this will give you a better feel for how to quantify uncertainty.

Below is a picture of two normal distributions. The one with the solid line has m1 = 0 and s1 = 1; the dashed line has m2 = 0.5 and also s2 = 1. In other words, the two distributions differ only in their central parameter, they have the same variance parameter. Obviously, large values are more likely according to distribution 2, and smaller values are more likely given distribution 1, as a simple consequence of m2 > m1 . However, once we get to values of about x = 4 or so, it doesn?t look like the distributions are that different. (Cue the spooky music.) Or are they?.

Under the main picture are two others. The one on the left is exactly like the main picture, except that it focuses only on the range of x = 3.5 to x = 5. If we blow it up like this, we can see that it is still more likely to see large values of x using distribution 2.

How much more likely? The picture on the right divides the probabilities of seeing x or larger with distribution 2 by distribution 1, and so shows how much more likely it is to see larger values with distribution 2 than 1. For example, pick x = 4. It is about 7.5 times more likely to see an x = 4 or larger with distribution 2. That?s a lot! By the time we get out to x = 5, we are 12 times more likely to see values this large with distribution 2. The point is that even very small changes in the central parameters lead to large differences in the probabilities of “extreme”, values of x.

(see the book)

This next picture again shows two different distributions, this time with m1 = m2 = 0 with s1 = 1 and s1 = 1.1. In other words, both distributions have the same central parameters, but distribution 2 has a variance parameter that is slightly larger. The normal density plots do not look very different, do they? The dashed line, which is still distribution 2, has a peak slightly under distribution 1’s, but the differences looks pretty small.

(see the book)

The bottom panels are the same as before. The one on the left blows up the area where x > 3.5 and x < 5. A big difference still exists. And the ratio of probabilities is still very large. It's not shown, but the plot of the right would be duplicated (or mirrored, actually) if we looked at x > ?5 and x < ?3.5. It is more probable to see extreme events in either direction (positive or negative) using distribution 2.

The surprising consequence is that very small changes in either the central parameter or the variance parameter can lead to very large differences at the extremes. Examples of these phenomena are easily found in real life, but my heightened political sensitivity precludes me from publicly pointing any of these out.

3. Intuition

We have learned probability and some formal distributions, but we have not yet moved to statistics. Before we do so, let us try to develop some intuition about the kinds of problems and solutions we will see before getting to technicalities. There are a number of concepts that will be important, but I don?t want to give them a name, because there is no need to memorize jargon, while it is incredibly important that you develop a solid under- standing of uncertainty.

The well-known Uncle Ted Nugent’s chain of Kill ‘em and Grill ‘em Vension Burger restaurants sell both Coke and Pepsi, and their internal audit shows they sell about an equal amount of each. The busy Times Square branch of the chain has about 5000 customers a day, while the store in tiny Gaylord, Michigan sees only about 100 customers. Which location is more likely to sell, on any given day, at least 2 times more Pepsi than Coke?

A useful technique for solving questions like this is exaggeration. For instance, the question is asking about a difference in location. What differs between those places? Only one thing, the number of customers. One site gets about 5000 people a day, the other only 100. Let?s exaggerate that difference and solve a simpler problem. For example, suppose Times Square still gets 5000 a day, but Gaylord only gets 1 a day. The information is that selling a Coke is roughly equal to the probability of selling a Pepsi. This means that, at Gaylord, to that 1 customer on that day, they will either sell 1 Coke or 1 Pepsi. If they sell a Pepsi, Gaylord has certainly sold more than 2 times as much Pepsi as Coke. The chance of that happening is 50%. What is two times as much Pepsi as Coke at Times Square? A lot more Pepsi, certainly. So it’s far more likely for Gaylord to sell a greater proportion of Pepsi because they see fewer customers. The lesson is that when the “sample size” is small, we are more likely to see extreme events.

What is the length of the first Chinese Emperor Qin Shi Huangdi’s nose? You don’t know? Well, you can make a guess. How likely is it that your guess is correct? Not very likely. Suppose that you decide to ask everybody you know to also guess, and then average all the answers together in an attempt to get a better guess. How likely is it that this averaged-guess is perfectly correct? No more likely. If you haven’t a clue about the nose, and nobody else does either, than averaging ignorance is no better than single ignorance. The lesson is that just because a large group of people agree on an opinion, it is not necessarily more probable that that opinion, or average of opinions, is correct. Uninformed opinion of a large group of people is not necessarily more likely to be correct than the opinion of the lone nut job on the corner. Think about this the next time you hear the results of a poll or survey.

You already posses other probabilistic intuition. For example, suppose, given some evidence E, the probability of A is 0.0000001 (A is something that might be given many opportunities to happen, e.g. winning the lottery). How often will A happen? Right. Not very often. But if you give A a lot of chances to occur, will A eventually happen? It?s very likely to.

Every player in petanque gets to throw three boules. What are the chances that I get all three within 5 cm? This is a compound problem, so let?s break it apart. How do we find out how likely it is to be within 5 cm of the cochonette? Well, that means the boule can be 5 cm in front of the cochonette, right near it, or up to 5cm beyond it. The chance of this happening is Pr(?5cm < x < 5cm|m = 0cm, s = 10cm, EN ). We learned how to calculate the probability of being in an interval last chapter:

pnorm(5,0,10)-pnorm(-5,0,10).

This equals about 0.38, which is the chance that one boule lands within, or +/- 5 cm, from the cochonette. What is the chance that all of them land that close? Well, that means the first one does and the second one and the third. What probability rule do we use now? The second, which tells us to multiple the probabilities together, which is 0.383 ? 0.14. The important thing to recall, when confronted with problems of this sort: do not panic. Try to break apart the complex problem into bite-size pieces.

3 responses so far

May 22 2008

Comments restored

Published by Briggs under Bad statistics

Thanks to a hot tip from Lucia, over at the Diet Diary, I have become wiser about spam. I installed the wp-spamfree plug-in and we’ll see how that works.

OLD “I have been getting an enormous amount of spam over the past week (1000s of postings a day; all caught by the spam filter), so I am shutting off comments for 24 hours in the hope this will get me off some spam lists. Sorry for the inconvenience. “

No responses yet