MA 213-B Fall 2010

This page is designed to be a tutorial of sorts which covers some of the basic tasks one can accomplish with R. Much more complete documentation can be found here. You are invited to read it, search for more help online, and try to come up with your own solutions to the homework before looking at the examples I provide below. Don't just copy the entire program without making any changes: your apparent lack of effort will make you look bad, the program may not do exactly what the homework assignment asks, and, most importantly, you won't learn anything.

To download files, right-click them and choose "save link as..." or something like that.

Data Files

HW 1

HW 2

HW 3

HW 8

Setting Up R

R should work in any GNU/Linux flavor, OS X, and Windows. In all cases, there is a very nice GUI which allows you to edit script files and run selected portions of them. In fact, you should not have to use the command line at all, unless you really want to. Nothing works better than a hands-on demonstration, so feel free to bring your computer to class or stop by my office if you need help to get started.

One of the common pitfalls is being unable to find the data file. If this is happening to you, make sure that R's current working directory (GUI has an option for that) is set to wherever you put your data.

It also came to my attention that OS X may sometimes save text files in UTF-16, which poses problems for R. If you are trying to read a file, and the file is found, but the data seems to be ALL wrong, make sure you save the text file as UTF-8 plain text.

Data Files

fake-relative-data.txt - data file for HW 1.

UMprofsalary.csv

WPOWER50.csv

NZBIRDS.csv

HW 1 Project

R will happily eat the plain text data, so that is what I am going to give it. I saved my (fake) data as a plain text file with rows separated by new lines and columns separated by commas. You can download and use fake-relative-data.txt to continue with this tutorial.

Vanilla R comes with two command line tools: R, the interactive interpreter and Rscript, a command which makes it easier to run scripts and produce the output ready for printing. The way I use R, I place my data and my scripts in a single directory and then run R and Rscript from there. Here I will proceed as if I started an interactive session.

Before I do anything else, I have to read the data:

x <- read.table("fake-relative-data.txt", sep=",", header=TRUE)

Here x is the name I gave to the data table. R calls these tables "data frames" and thinks of them as of lists of vertical columns. The columns, in turn, are one-dimensional arrays or vectors. <- is an assignment operator; the name on the left will take the value on the right. The function read.table() reads a file and returns the data found within. The options passed to it specify, respectively, the name of the plain text data file relative to the working directory, the list of characters which separate data columns, and whether or not there is a header row present in the file.

I can check that the data was loaded by typing

Calling an object by name causes R to print its contents, so the line above should print the entire table.

My next task is to create a column with age categories. I could create it as a new vector, but here it makes more sense to add it to the existing frame x. I start by preparing two vectors: one with the age intervals and another one with the text labels. To make a vector, I call the function c() which creates a vector out of its arguments.

age_classes <- c(0,1,12,19,29,69,Inf)
age_class_names <- c("baby", "child", "teen", "young adult", "adult", "senior")

The way age_classes is defined, it will correspond to intervals $(0, 1], (1, 12], …, (69, ∞]$ . The new column with age classes can be created now.

x$age.class <- cut(x$age, age_classes, age_class_names)

x$age.class above refers to the column of x named age.class. Since no such column exists in the data frame x, R creates one automagically. The function cut() takes the age column from x and creates a new vector with categories given in age_class_names, according to intervals defined by age_classes.

Now I can display a pie chart for the newly created column:

pie(table(x$age.class), main="age categories, all data points")

To display the pie chart for just males, I need to select just the rows where x$gender is male. Here is one way to do it, by creating a new data frame:

just_males <- x[x$gender == "male",]
pie(table(just_males$age.class), main="age categories, males only")

x[x$gender == "male",] tells R to select all rows of x where the gender is male, and all columns. This is just one of the ways to select a subset of entries in a data frame, and it's a pretty fancy one. Less complicated examples are

x[,4], the 4th column,
x[,], the entire frame,
x[2:4,], rows 2 through 4, all columns
x[2,4], the entry in the 2nd row, 4th column.

In order to save all this hard work for later use, I can put all of these instructions (and more) in a file.

hw-1.R

I could run this file with Rscript, or I could load it into GUI, select everything and execute. Since you are probably going to do the latter, I prepared the script so that it saves the graphics into a local pdf file.

HW 2

Exercise 1

Get UMprofsalary.csv. R will read it if you say

x <- read.csv("UMprofsalary.csv")

There are 4 columns in that file, and I can see their names by saying

colnames(x)

To save time and effort, I also say

attach(x)

and now I can refer to columns in the table x by their names, so I can call the fourth column pay2007 instead of x$pay2007.

To draw a histogram of the salary I can say

hist(pay2007, breaks=12, xlab="salary", main="umass professors' salary in 2007", col="pink")

and it will produce a histogram with about 12 pink bars and the appropriate labels.

The mean and the median of a data column can be displayed in a variety of ways.

mean(pay2007)
median(pay2007)
summary(pay2007)

To finish it off, I need to produce a graphical display which compares the salaries for different ranks of professors. Without reading the whole table, I can make R to display all the ranks present in the data:

table(jobtitle)

And here is one neat way to produce the required side-by-side display:

boxplot(pay2007 ~ jobtitle, names=c("assoc. prof.","assist. prof.","prof."), ylab="salary", main="professors' salary by rank")

Unlike many other graphical displays, boxplot understands this fancy formula (pay2007 ~ jobtitle) and produces lined up salary box plots, one for each job title category. The rest is just labeling.

Exercise 2.18

This is the file: NZBIRDS.csv, you can read it as usual.

x <- read.csv("NZBIRDS.csv")
attach(x)

Display the extinction pie chart:

pie(table(Extinct), main="extinction proportion")

Before we proceed, let's look how by() function works. This, for example, will break up x into two tables according to flight capability and display summary for each:

by(x, Flight, summary)

This will break up the table x by habitat and compute the mean egg length for each habitat:

by(x, Habitat, function(z) mean(as.numeric(z$Egg.Length)))

The assignment asks to display extinction data by flight capability. I begin by preparing the graphical device by telling R to display plots on a grid with one row and two columns.

par(mfrow=c(1,2))

The following command will take care of two plots at the same time (it only looks so complicated because I really try to label everything nicely):

by(x, Flight, function(z) pie(table(z$Extinct), main="extinction proportion", sub=sprintf("flight: %s", z$Flight[1])))

OK, let's do the same, but for the habitat. There are three different habitat categories, so I will use a two-by-two grid.

par(mfrow=c(2,2))
by(x, Habitat, function(z) pie(table(z$Extinct), main="extinction proportion", sub=sprintf("habitat: %s", z$Habitat[1])))

Professor Ray pointed out even better ways to compare several sets of data: a side-by-side barplot and a stacked barplot. It's a much easier way to script, too. You can do this immediately after reading and attaching the frame x. This will create a frequency table for both the extinction status and the habitat. Print it and you will see what I mean:

z <- table(Extinct, Habitat)
z

And this you can plot as a stacked barplot

barplot(z, legend=rownames(z), col=c("skyblue","orange","limegreen"))

or as a side-by-side barplot

barplot(z, legend=rownames(z), beside=T, col=c("plum","orchid","royalblue"))

HW 3

There are just a few things thing you need to know to grok this homework. Given a data column named foo, you can display the standard deviation with

sd(foo)

and something like the 20th percentile with

quantile(foo, 0.2)

Additionally, if your column has NA values (like the NZ birds file does in Egg.Length) then a lot of statistics will fail to compute unless you tell R how to deal with NA. You can remove them by saying, for example,

quantile(foo, 0.2, na.rm=T)

HW 8

NZBIRDS.csv is back, and now we need to sample 50 out of 130-ish egg lengths and compute some sample statistics. After we loaded the file with

x <- read.csv("NZBIRDS.csv")
attach(x)

we should first get rid of NA values (there are two of them in this data set)

y <- na.omit(Egg.Length)

Now y is a list of egg lengths, with NA entries omitted. To get a pseudo-random sample of size 50 without replacement, use

s <- sample(y, 50)

Note that our code can be made much more compact if we are to give up some readability:

s <- sample(na.omit(read.csv("NZBIRDS.csv")$Egg.Length), 50)

To compute the confidence interval for the population mean, we can find the standard error, which is the radius of the confidence interval. For example, for a 90% confidence interval ( $α = 0.1$ ) we need the Z score which cuts off 95% of the distribution on the left. Below, we are literally computing $Z_{α / 2} \frac{s}{\sqrt{n}}$

st.err <- qnorm(0.95)*sd(s)/sqrt(length(s))

The interval itself is, of course, mean(s) $\pm$ st.err.

MA 213-B Fall 2010

Table of Contents

Setting Up R

Data Files

HW 1 Project

HW 2

Exercise 1

Exercise 2.18

HW 3

HW 8