Update #2. I moronically uploaded a blank document. I have no idea how. It’s all better now.
Update. I idiotically forgot to put a link. Here it is.
Chapter 7 is Reality. This is usually Chapter 1 in most intro stats books. Those other books invariably start students with topics like “measures of central tendency” and “kinds of experiments” etc. Nothing necessarily wrong with any of this, but the student usually has no idea why he should care about “central tendency” in the first place. Why memorize formulas for means and (population or other) standard deviations? What use are these things in understanding how to quantify uncertainty?
So I put these topics off until the reader realizes that understanding uncertainty is paramount. The whole chapter is nuts and bolts about how to read data into R and do some elementary manipulations. Like Chapter 5, it’s not thrilling reading, but necessary. The homework for 7 asks readers to download a set of R functions at http://wmbriggs.com/book/Rcode.R, but it’s not there yet because I’m still polishing the code.
Some of the formatting is off in the Latex source, but I won’t fix that until I’m happy with the final text. No pictures are here; all are in the book.
1. Kinds of data
Somewhere, sometime, somehow, somebody is going to ask you to create some kind of data set (that time is sooner than you think; see the homework). Here is an example of such a set, written as you might see it in a spreadsheet (a good, free open-source spreadsheet is Open Office, www.openoffice.org):
This data is part of a survey asking people their favorite colors (Q1), while recording their sex, annual income, the number of sub-occipital nodules on their brain, and whether or not the interviewee thought the subject ridiculous or not. There is a lot we can learn from this simple fragment.
The first is always use full, readable, English names for the variables. What about Q1, which was indeed the first question on the survey. Why not just call it “Q1”? “Q1” is a lot easier to type than “favorite color”. Believe me, two weeks after you store this data, you will not, no matter how much you swear you will, remember that Q1 was favorite color. Neither will anybody else. And nobody will be able to guess that Q1 means favorite color.
Can you suggest a better name? How about “favcol”, which has fewer letters than “favorite color”, and therefore easier to type? What are you, lazy? You can?t type a few extra letters to save yourself a lot of grief later on?
How about just “favorite color.” Well, not so good either, because why? Because of that space between “favorite” and “color”; most software cannot handle spaces in names. Alternatives are to put underscore or period between words “favorite color”, or “favorite ? color”. Some people like to cram the words together camel style, like “favoriteColor” (the occasional bump of capital letters is supposed to look like a camel: I didn?t name it). Whichever style you choose, be consistent! In any case, nobody will have any trouble understanding that “favoriteColor” means “favorite color”.
Notice, too, that the colors entered under “Q1” use the full English name for the color. Spaces are OK in the actual data, just not in variable names: for example, “burnt orange” is fine. Do not do what many sad people do and use a code for the colors. For example, 1=taupe, 2=envy green, 3=fuschia, etc. What are you trying to do with a code anyway? Hide your work from Nazi spies? Never use codes.
That goes for variables like “Sex”, too. I cannot tell you how many times I have opened up a data set where I have seen Sex coded as “1” and “2”, or “0” and “1”. How can anybody remember which number was which sex? They cannot. And there is no reason too. With data like this, abbreviation is harmless. Nobody, except for the politically correct, will confuse the fact that “M” means male and “F” female. But if you are worried about it, then type out the whole thing.
Similarly for “Ridiculous”, where I have used the abbreviation “Y” for yes and “N” for no. Sometimes a “0” and “1” for “N” and “Y” are acceptable. For example, in the data set we?ll use in a moment, “Vomiting” is coded that way. And, after all, 0/1 is the binary no/yes of computer language, so this is OK. But if there is the least chance of ambiguity for a data value, type the whole answer out. Do not be lazy, you will be saving yourself time later.
It should be obvious, but store numbers as numbers. Height, weight, income, age, etc., etc. Do not use any symbols with the numbers. Store a weight as “213” and not “213 lbs”. If you are worried you will forget that weight is in pounds, name the variable Weight.LBS or something similar.
What if one of your interviewees refused to answer a question? This will often happen for questions like “Income”. How should you code that? Leave his answer blank! For God’s sake, whatever you do, do not think you are being clever and put in some mystery code that, to you, means “missing.” I have seen countless times where somebody thought that putting in a “99” or a “999” for a missing income was a good idea. The computer does not know that 999 means “missing”; it thinks it is just what it looks like—the number 999. So when you compute an average income, that 999 becomes part of the average. Also don?t use a period, the full stop. That?s a holdover from an ancient piece of software (that some people are still forced to use).
There are times when an answer is purposely missing, and a blank should not be used. For example, if “Income” is less than 20000, then the interviewee gets an extra question that people who make more than 20000 do not get. Usually, this kind of rule can be handled trivially in the analysis, but if you want to show that somebody should not have answered and not that they did not answer, then use a code such as “PM” for “purposely missing”. Even better would be to write “purposely missing”, so that somebody who is looking at your data three months down the road doesn?t have to expend a great deal of energy on interpreting what “purposely missing” means.
Try to use a real database to store your data, and keep away from spreadsheets if you can. A real database can be coded so that all possible responses for a variable like ?Race? are pre-coded, eliminating the chance of typos, which are certain to occur in spreadsheets.
Here?s something you don?t often get from those other textbooks, but which is a great truth. You will spend from 80 to 90% of your time, in any statistical analysis just getting the data into the form readable for you and your software. This may sound like the kind of thing you often hear from teachers, while you think to yourself, “Ho, ho, ho. He has to tell us things like that just to give us something to worry about. But it’s a ridiculous exaggeration. I’ll either (a) spend 10-15% of my time, or (b) have somebody do it for me.” I am here to tell you that the answers to these are (a) there is no known way in the universe for this to be true, and (b) Ha ha ha!
The absolute best thing to do is to store you data in a database. I often use the free and open source MySQL (.com, of course). Knowing how to design, set up, and use such a database is beyond what most people want to do on their own. So most, at least for simple studies, opt for spreadsheets. These can be fine, though they are prone to error, usually typos. For instance, the codings “Y” and “Y ” might look the same to you, but they are different inside a computer: one has a space, one doesn’t. The computer thinks these are as different as “Q” and “W”. This kind of typo is extraordinarily common because you cannot see blank spaces easily on a computer screen. To see if you have suffered from it, after you get your data into R type levels(my variable name) and each of the levels, like “Y” and “Y ” will be displayed. If you see something like this, you’ll have to go back to your spreadsheet and locate the offending entries and correct them.
A lot of overhead is built into spreadsheets. Most of it has to do with prettifying the rows and columns?bold headings, colored backgrounds, and so on. Absolutely none of this does anything for the statistical analysis, so we have to simplify the spreadsheet a bit.
The most common way to do this is to save the spreadsheet as a CSV file. CSV stands for Comma Separated Values. It means exactly what it says. The values from the spreadsheet are saved to an ordinary text file (ASCII file), and each column is separated by a comma. An example from one row from the dataset we’ll be using is
Note the clever insertion of commas between each value.
What this means is that you cannot actually use commas in your data. For example, you cannot store an income value as “10,000”; instead, you should use “10000”. Also note that there is no dollar sign.
Now, in some countries, where the tendrils of modern society have not yet reached, people unfortunately routinely use commas in place of decimal points. Thus, “3.42” written here is “3,42” written there. You obviously cannot save the later in a CSV file because the computer will think that comma in “3,42” is one of the commas that separates the values, which it does not. The way to overcome this without having to change the data is to change the delimiter to something other than a comma; perhaps a semicolon or a pound sign; any kind of symbol which you know won?t be in the regular data. For example, if you used an @ symbol, your CSV file would look like
The only trick will be figuring out how to do this. In Open Office, it?s particularly easy: after opening up the spreadsheet and selecting “Save As”, select the box “Edit Filter settings” and choose your own symbol instead of the default comma. A common mistake is to type an entry into, say, an Opinion variable, where a person’s exact words are the answer. Guard against using a comma in these words else the computer will think you have extra variables: the computer thinks there is a variable between each comma.
It?s finally time to play with real data. This is, in my experience, another panic point. But it need not be. Just take your time and follow each step. It is quite easy.
The first trick is to download the data onto your computer. Go to the book website and download the file appendicitis.csv and save it somewhere on your hard disk in a place where you can remember. The place where it is is called the path. That is, your hard drive has a sort of hierarchy, a map where the files are stored. In you are on a Windows machine, this is usually the
C:/ drive (yes, the slash is backwards on purpose, because R thinks like a Linux computer, or Apple, which has the slashes the other way). Create your own directory, say, mydata (do not put a space in the name of the folder), and put the appendicitis file there. So the path to the file is
C:/mydata/appendicitis.csv. Easy, right? If you are on a Linux or Mac, it?s the same idea. The path on a Mac is usually something like
/Users/YOURNAME/mydata/appendicitis.csv. On a Linux box it might be
Open R. Then type this exact command:
x = read.csv(url("http://wmbriggs.com/book/appendicitis.csv"))
There is a lot going on here, so let?s go through it step by step. Ignore the
x = bit for a moment and concentrate on the part that reads
read.csv(...). This built-in R function reads a CSV file. Well, what else would you have expected from its name? Inside that function is another one called
url(), whose argument is the same thing you type into any web browser. The thing you type is called the URL, the Uniform Resource Locater, or web address. What we are doing is telling R to read a CSV file directly off the web. Pretty neat!
If you had saved the file directly to your hard drive, you would have loaded it like this
x = read.csv("C:/mydata/appendicitis.csv")
where you have to substitute the correct path, but otherwise is just as easy.
The last thing to know is that when the CSV file is read in it is stored in R?s memory in the object I called x. R calls these objects data frames. Why didn?t they call them data sets? I have no idea. How did I know to use an x, why did I choose that name to store my data? No reason at all except habit. You can call the dataset anything you want. Call it mydata if you want. It just doesn?t matter.
Now type just
x and hit enter. You?ll see all the data scroll by. Too much to look at, so let?s summarize it:
This is data taken on patients admitted to an emergency room with right lower quadrant pain (in the area the appendix is located) in order to find a model to better predict appendicitis (Birkhahn et al., 2006). Each of the variables was thought to have some bearing on this question. We?ll talk more about this data later. Right now, we?re just playing around. When we run the command we get the summary statistics for each variable in x. What it shows is the mean, which is just the arithmetic average of the data, the median, which is the point at which 50% of the data values are larger and 50% smaller, the 1st Qu., which is the first quartile and is the point at which 25% of the data values are smaller, the 3rd Qu. which is the third quartile and is the point at which 75% of the data values are smaller (and 25% are larger, right?). Also given in the Min. which is the minimum value and Max which is the maximum. Last is NA’s, which are the number, if any, of missing values. These kinds of statistics only show for data coded as numbers, i.e. numerical data. For data that is textual, also called categorical or factorial data, the first few levels of categories are shown with a count of the number of rows (observations) that are in that category.
You will notice that variables like Pregnancy are not categorical, but are numerical, which is why we see the statistics and not a category count. Pregnancy is a 0/1 variable and is technically categorical; however, like I said above, it is obvious that “0” means “not pregnant”, so there is no ambiguity. The advantage to storing data in this way is that the numerical mean is then the proportion of people having Pregnancy =1 (think about this!).
Let’s just look at the variable Age for now. It turns out we can apply the summary function on individual variables, and not just on data frames. Inside the computer, the variable age is different than Age (why?). So try
summary(Age). What happens? You get the error message
Error in summary(Age) : object "Age" not found. But it?s certainly there!
You can read lots of different datasets into R at the same time, which is very convenient. I work on a lot of medical datasets and every one of them has the variable Age. How does R know which Age belongs to which dataset? By only recognizing one dataset at a time, through the mechanism of attaching the dataset directly to memory, to R?s internal search path. To attach a dataset, type
Yes, this is painful to remember, but necessary to keep different datasets separate. Anyway, try
summary(Age) again (by using the up arrow on your keyboard to recall previously typed commands) and you’ll see it works.
Incidentally, summary is one of those functions that you can always try on anything in R. You can?t break anything, so there is no harm in giving it a go.
The number one, unalterable rule that you must obey when beginning work with a new dataset is always look at the data first! Too many people forget this rule to their ultimate embarrassment.
summary() function is easy and gives you information on the distribution of your data in text. But it?s usually easier to see what?s going on with pictures. The visual equivalents of summary are boxplot, hist, and table. Let?s do a boxplot first?it?s easy,
The y-axis are the values of Age. The center line on the boxplot is the median, the outer edges of the box are the first and third quartile, and the far ends of the lines are the 5% and 95% quantiles, defined in just the same way as the other quartiles. Boxplots will often also stick dots beyond the far ends for numbers that exceed that 99% quantile and numbers that are less than the 1% quantile.
Next up is
hist(Age), that tries to do exactly the same thing as boxplot, which is to give you a visual summary of the range and likelihood of various data values.
You can?t do a boxplot on data like Race, because that variable is categorical. Instead, do a table by
table(Race) to get a count of each category. This is OK, but just gives the counts when frequently you want the frequencies. To get that, you have to make a table of the table (yes, this is a pain):
plot is another one of those commands, like summary, that you can always try on anything. It never hurts and you can’t break anything.
I originally included these plots in the book so you could see them, but I decided against doing this to guard against your laziness in the homework. Do these commands yourself!
5. Extra: Advanced topics
Temperature is one of the variables. You can try the summary command on it and it works just fine. Sometimes you only want the mean and don?t need all the other business, so you can use the function
mean(Temperature). Try it and you get
 NA. What gives? Do a
summary(Temperature) and you’ll see that there are 7 missing values. The function mean is too stupid to give you a mean in the presence of missing values. In a way, this is a good thing, because it forces you to recall that you have an incomplete dataset, and that should give you pause. Why are the values missing? It could be important. You can get around the missing values by typing
mean(Temperature, na.rm=T), which says take the mean, and remove (
rm) the missing (
na) values. The
=T means TRUE (you could also type the whole word out as TRUE; use capitals). The mean will then be computed. R is wonderful, but sometimes the way it handles missing values is a pain in the ass.
A back-of-the envelope drawing that you can make by hand is called a stem-and-leaf plot: it does not require you to first sort your data, but you do have to discover the minimum and maximum values. In R it is
Histograms and boxplots are very old, were wonderful in their day, and in some cases (discrete data) are just the thing, but we can do better with numbers that more are approximated as continuous (see Chapter 4), like Age. For those, use a density estimate, which is, in a sense, an automated superior histogram. To do this in R type
You can assign the output of any function to a new variable, created by you. So, if you want to store the table for Race, type
fit = table(Race), where I chose the name fit for no good reason. All the table results are now in fit. To see it, just type fit. This makes getting proportions easier because you can now
prop.table(fit). You could also
plot(prop.table(table(Sex))) or any categorical variable; try
pairs(x, panel=panel.smooth) and see what happens.