Summer 2010 — R: Basics

Summer 2010 — R: Basics

<- General: Data Structure Home R: Summarization and Aggregation->

Content

  1. Intro
  2. Vectors and Assignment
    1. Vectors
    2. Assignment
    3. Vector Arithmetic
    4. Indexing
    5. Handy Vector Functions
  3. Logical Operators
    1. %in%
  4. Data Input-Output
    1. Input
    2. Output
  5. Data Frames
    1. Data Frame Indexing and Subsetting
  6. Factors
    1. levels()
    2. relevel()
    3. reorder()

Intro

The goal for this section will be to get you comfortable with futzing through R. For the most part, these notes are pulled together from Week 1 and Week 2 of the previous R Study Group notes. I'll only be going through this material for one week, so that we can get directly to the more advanced stuff.

To practice data input-output from your local computer, download and unzip the data from Keith Johnson's book Quantitative Methods in Linguistics. Here[ZIP]

Vectors and Assignment

Vectors are the most basic data types you'll work with in R, and assignment of values to variables is the most basic operation you'll utilize.

Vectors

Vectors are essentially lists of data, and can contain characters, numbers, or TRUE FALSE values, but not mixed types. There are a number of ways to create vectors in R, and frequently doing data manipulation will produce subvectors of data. The table below has some commands for creating vectors.

1:10This produces a vector of integers from 1 to 10.
Reversing the order of the numbers will produce a vector of decreasing values
c(...)This produces a vector of whatever is passed as an argument to c().
e.g. c(1,2,3,4)
seq(from,to,...)This produces a sequence of numbers either by a given increment or evenly spaced to a given length.
seq(1,10,by=0.5)
seq(1,10,length=11)
rep(x,...)This produces a vector of repetitions of x by a given number of times.
rep(1,6)
rep(1:3,2)
rep("hello world",4)

Mixed type vectors will automatically be simplified to the least restrictive type. Boolean logical values will be simplified to numerical 1's and 0's in a mixed vector with numbers, and everything will be simplified to chracter strings in a mixed vector with characters. This problem will usually need to be addressed when loading poorly fomatted data into R. (For more marginal problems of dealing with integers vs. floating points, see here.)

Here's a tip for converting a vector of True False values to 1's and 0's (sometimes people want to do this).

x <- sample(c(T,F), 20, replace = T)
x*1

Assignment

The assignment operator in R is <- or -> The following are equivalent assignments:

x <- 1:10
1:10 ->x
The variable x now represents the vector of integers from 1 to 10. Variables in R are case sensitive, so if you assign something to x you have not assigned it to X.

Also note that functions and operations never directly modify variables. Below we discuss some vector arithmetic. The operation x*2 does not modify x, but rather produces output which can be assigned to a new variable, or back to the original variable. Without assigning the output of a function or operation to some variable, that output is not available in any way to further computation.

x
x*3
x*3->x
x

Vector Arithmetic

Just like you can carry out arithmetic functions on numbers, you can do the same to vectors. In fact, single numbers are treated as vectors of length 1, so all arithmetic is actually vector arithmetic. The arithmetic functions in R (even * /) are all element-wise.

x <- 1:10
x
x+1
x*2
x^2
x/3
log(x)
You can also do arithmetic operations on two multilength vectors. R will repeat the smaller vector so it is the same length as the larger vector.
x*x
x+x

y<-c(1,2)
x*y
If the smaller vector is not a multiple of the larger vector, you will still get output, but R will print a warning.
x<-rep(1,10)
y <- c(1,2,3)
x*y

Indexing

Elements in a vector are indexed, and can be referenced by their index. **Note** In R, unlike many languages, indexing begins with 1.

LETTERS
LETTERS[1]
LETTERS[1:4]
LETTERS[c(1,3,5)]

You can also access subvectors using a vector of TRUE or FALSE values as an index. Usually, this TRUE/FALSE vector will be the same length as the vector it is indexing.

t.f<- rep(F,26)
t.f
LETTERS[t.f]
t.f[3]<- T
LETTERS[t.f]

x<-1:10
x
x>5
x[x>5]

If you use a vector of Boolean values as an index which is shorter than the vector it is indexing, R will repeat it so it is the same length.

x[c(T,F)]
x[c(F,T)]
As might be expected, any vector of Boolean values, if preceded by ! will produce a vector of the opposite values.
t.f
!t.f
LETTERS[t.f]
LETTERS[!t.f]

Handy Vector Functions

  • Numeric Vectors
    • sum()
    • mean()
    • sd() (standard deviation)
    • max()
    • min()
    • median()
    • range()
  • Any Vector
    • rev() (reverse)
    • unique() (list of unique vector values)

Logical Operators

The following operators will return a vector of TRUE and FALSE values.

==exactly equal to
!=not equal to
>greater than
>=greater than or equal to
<less than
<=less than or equal to

Multilength vectors behave the same for these logical operators as for arithmetic operators.

x <- 1:10
y <- x+1

x < y
x == x

These operate over vectors of T F values, or logical statements.

!xnot x
All T become F, and vice versa
x|yx or y
xyx|y
TTT
TFT
FTT
FFF
x&yx and y
xyx&y
TTT
TFF
FTF
FFF

%in%

I found out about the %in% operator too late. It matches all i elements of the x vector against the y vector, and returns TRUE if x[i] is also a member of y.

3 %in% 1:10
1:10 %in% 3
-3:3 %in% 1:10

(-3:3)[-3:3 %in% 1:10]
(-3:3)[!(-3:3 %in% 1:10)]

Data Input-Output

Here, we'll talk about how to load data into R, and how to write data back out to a data file. When loading a data file into R, you are just loading it into the R workspace. Any alterations or modifications you make to the data frame will not be reflected in the file in your system, just in the copy in the R workspace.

Input

For the most part, data that will be loaded into R will be in the form of data tables. Ideally, they will be plain text, and delimited either by tabs or other whitespace, or by commas (.csv files).

The two most likely commands to be used will be read.delim() or read.csv(). read.delim() is used to read in files delimited by tabs, and read.csv() for files delimited by commas. Both of these functions assume that the first line is a header. If this is not the case, then set the option header = F. If the file you are looking at is delimited neither by tabs nor by commas, you should try read.table() (which the first two are based on) with the option sep given the appropriate delimiter. As always see ?read.table for more info on all of these functions.

These read.table commands will read in the data as a data frame, a particular data structure in R that we'll talk about below. Be sure to assign the output of read.table() to a variable, or else it will just print the file, and that's no good to no one.

For example, we could read in the data in cherokeeVOT from Johnson 2008 ( here again [ZIP])as follows.

vot <- read.delim("PATH/cherokeeVOT.txt")
#or
vot <- read.table("PATH/cherokeeVOT.txt",sep = "\t",header = T)

It appears that on Windows machines, the proper file delimiter is still "/", even though if you copy the path, it will be delimited by "\". If you're copying a path into the prompt with the "\" delimiter, change them all to "/" to successfully load the file.

There is also the very handy file.choose() function, which will open up a file browser for you to navigate to the file.

vot <- read.delim(file.choose())

I'm not actually a huge fan of the file.choose() function. If you mess up somewhere along the line, it makes it harder to rerun your code. If you insist on using the file.choose() function, I would suggest assigning the output of file.choose() to a variable first.

votPATH <- file.choose()
vot <- read.delim(votPATH)

Output

If you want to save data a data frame that you've edited or enriched, use write.csv() or write.table(). As the name suggests, write.csv() writes the data frame to a comma separated file. Note, if one of your variables has a comma in it (like "City,State"), do not write to a csv, because it will ruin your life. I personally don't like a few of the defualt settings to these commands, so here's what I change:

sep
-- If using write.table(), I set this to "\t", which writes a tab delimied file. The default is a single space: " "
quote
-- I set this to F. By default, character strings, and the character labels for factors will be closed in quotes. I don't like this so much, and it makes it harder to read by eye or for some other script.
row.names
-- I set this to F. By default, it will write a column for the row names in the data frame. These are usually meaningless, just being the row number.
The final command will usually look something like this:

write.delim(vot,file = "PATH/vot.txt",sep = "\t",quote = F,row.names = F)

Again, on a Windows machine, use "/" even though the standard file delimiter is "\".

Data Frames

Data Frames are the format you are most likely to access data from in R.

Lets focus on the Cherokee VOT data from Johnson 2008. To see how many rows and columns it has, use nrow() and ncol().

ncol(vot)
nrow(vot)

There are 3 variables coded for 44 observations in the data. To see the column names, which are the same as the variable names here, use colnames().

colnames(vot)

To see the first 6, or last 6 observations, use head()and tail()

head(vot)
VOT year Consonant 1 67 1971 k 2 127 1971 k 3 79 1971 k 4 150 1971 k 5 53 1971 k 6 65 1971 k
tail(vot)
VOT year Consonant 39 106 2001 t 40 54 2001 t 41 49 2001 t 42 56 2001 t 43 58 2001 t 44 97 2001 t

To get 5 number summaries of numerical data in a data frame and counts for factors, use summary(). (summary() is a general purpose function, and generates unique and interesting output for many different objects.)

summary(vot)
VOT year Consonant Min. : 45.00 Min. :1971 k:21 1st Qu.: 71.50 1st Qu.:1971 t:23 Median : 81.50 Median :2001 Mean : 96.45 Mean :1989 3rd Qu.:120.25 3rd Qu.:2001 Max. :193.00 Max. :2001

Data Frame Indexing and Subsetting

Indexing in data frames can work almost exactly like it does in matrices. To see the first 6 rows of a data frame, you can use head() -or- data[1:6,]

vot[1:6,]

Likewise, you can see a column either by using $, or giving the column index, or name.

vot[ ,"VOT"]
head(vot[ ,c("VOT","Consonant","year")])
head(vot[ ,c("year","VOT","Consonant")])

Frequently, you might want to pull out subets of the data frame, or subsets of variables based on certain conditons. There are a few ways to do this. First, you could generate a vector of boolean values, and use this as an index for the data frame.

vot$VOT > 100
vot[vot$VOT > 100,]
vot[vot$Consonant == "t",]

If you just want to see variable values for a given factor level, there are at least two ways to do this.

vot[vot$year == 1971,]$VOT
#or
vot$VOT[vot$year == 1971]

The first selects a subset of the data frame, then takes the VOT variable from that subset. The second takes the variable first, the takes a subset of that. I'm not sure if one approach is more efficient, common or preferable than the other.

You could also use subset() function. It takes a data frame as its first argument, a conditional expression as its, second, and an optional selection argument, which takes a vector of variable names to return as an argument

subset(vot,VOT>100 & Consonant == "k")
subset(vot,VOT>100 & Consonant == "k",select = c("VOT","year"))

subset(vot,VOT>100 & Consonant == "k")$VOT

If you want to save time by not needing to do the $ reference for a data frame you're working with a lot, you can use the attach() command, which will allow you to directly reference the variable names. Note, however, that any type conversions you make to a data frame's variable will not be reflected in the attached version. Once you've finished using the data frame, you can remove it from the search path with detach().

Factors

For the most part, if we're using data which has some variable coded as a character string, it is actually a factor with specific levels.

A good way to grasp the structure of factors is to build one from scratch. Say we had collected class data in a study, and coded lower working class, upper working class, and lower middle class with numeric codes, 1, 2, and 3, repectively. First, lets simulate a sample of 10 speakers. sample() will simulate such a sample. See ?sample for more details.

class <- sample(1:3, 10, replace = T)
class

Your numbers will probably be different. Right now, class is a numerical vector. The factor() function will convert it to a factor vector.

factor(class)

We can also add meaninful labels to the factor levels.

class.f<-factor(class,labels = c("LWC","UWC","LMC"))
class.f

In essence, the factor vector has a representation like this:

VectorDictionary
31 = "LWC"
12 = "UWC"
13 = "LMC"
2
3
3
3
2
3
3

Compare the following printed output

class
class.l<-c("LWC","UWC","LMC")
class.l
class.l[class]
class.f

The factor vector that we built is unordered. That is, the factor levels define groups rather than an ordering. It is possible to also define ordered factors, which will strongly affect model fitting. I haven't used ordered factors much myself, so I don't really understand how they work (but I should).

levels()

The levels() command accesses the levels of a factor. It will either produce a character vector of the current levels, or allow you to edit the levels by assigning a new character vector to it.

> levels(class.f) > old.levels<-levels(class.f) > new.levels <- c("LowerWorkingClass","UpperWorkingClass","LowerMiddleClass")
> levels(class.f)<-new.levels
> class.f

relevel()

Don't use levels() to try to change the order of levels. Here's a demonstration of what will happen

x <- factor(1:3, labels = c("one","two","three"))
x
levels(x) <- c("two","one","three")
x

All we've done is change the labels of each level, which ends up scrambling their meaningful labels. If we really wanted "two" to be the first level (this will be important for fitting regression models), we should use relevel().

x <- factor(1:3, labels = c("one","two","three"))
x
x <- relevel(x, "two")
x

relevel() has reordered the levels, making "two" the first one, but has kept the labels for each level the same.

reorder()

There is also a pretty nifty function called reorder() which will reorder the levels of a factor according to their values on some other dimension,

x <- as.factor(c(rep("A",10),rep("B",10)))
value <- c(rnorm(10,mean = 1), rnorm(10, mean = -1))
tapply(value, x, mean)

x <- reorder(x, value, mean)
tapply(value, x, mean)