Summer 2010 — R: Basics
<- General: Data Structure | Home | R: Summarization and Aggregation-> |
Content
Intro
The goal for this section will be to get you comfortable with futzing through R. For the most part, these notes are pulled together from Week 1 and Week 2 of the previous R Study Group notes. I'll only be going through this material for one week, so that we can get directly to the more advanced stuff.
To practice data input-output from your local computer, download and unzip the data from Keith Johnson's book Quantitative Methods in Linguistics. Here[ZIP]
Vectors and Assignment
Vectors are the most basic data types you'll work with in R, and assignment of values to variables is the most basic operation you'll utilize.
Vectors
Vectors are essentially lists of data, and can contain characters, numbers, or TRUE FALSE values, but not mixed types. There are a number of ways to create vectors in R, and frequently doing data manipulation will produce subvectors of data. The table below has some commands for creating vectors.
This produces a vector of integers from 1 to 10. Reversing the order of the numbers will produce a vector of decreasing values |
|
This produces a vector of whatever is passed as an argument to
e.g. |
|
This produces a sequence of numbers either by a given
increment or evenly spaced to a given length. seq(1,10,length=11) |
|
This produces a vector of repetitions of rep(1:3,2) rep("hello world",4) |
Mixed type vectors will automatically be simplified to the least restrictive type. Boolean logical values will be simplified to numerical 1's and 0's in a mixed vector with numbers, and everything will be simplified to chracter strings in a mixed vector with characters. This problem will usually need to be addressed when loading poorly fomatted data into R. (For more marginal problems of dealing with integers vs. floating points, see here.)
Here's a tip for converting a vector of
x <- sample(c(T,F), 20, replace = T) x*1
Assignment
The assignment operator in R is
The variablex <- 1:10
1:10 ->x
Also note that functions and operations never directly modify variables. Below we discuss some vector arithmetic. The operation
x
x*3
x*3->x
x
Vector Arithmetic
Just like you can carry out arithmetic functions on numbers, you can do the same to vectors. In fact, single numbers are treated
as vectors of length 1, so all arithmetic is actually vector arithmetic. The arithmetic functions in R
(even
You can also do arithmetic operations on two multilength vectors. R will repeat the smaller vector so it is the same length as the larger vector.x <- 1:10
x
x+1
x*2
x^2
x/3
log(x)
If the smaller vector is not a multiple of the larger vector, you will still get output, but R will print a warning.x*x
x+x
y<-c(1,2)
x*y
x<-rep(1,10)
y <- c(1,2,3)
x*y
Indexing
Elements in a vector are indexed, and can be referenced by their index. **Note** In R, unlike many languages, indexing begins with 1.
LETTERS
LETTERS[1]
LETTERS[1:4]
LETTERS[c(1,3,5)]
You can also access subvectors using a vector of
t.f<- rep(F,26)
t.f
LETTERS[t.f]
t.f[3]<- T
LETTERS[t.f]
x<-1:10
x
x>5
x[x>5]
If you use a vector of Boolean values as an index which is shorter than the vector it is indexing, R will repeat it so it is the same length.
As might be expected, any vector of Boolean values, if preceded byx[c(T,F)]
x[c(F,T)]
t.f
!t.f
LETTERS[t.f]
LETTERS[!t.f]
Handy Vector Functions
- Numeric Vectors
sum() mean() sd() (standard deviation)max() min() median() range() - Any Vector
rev() (reverse)unique() (list of unique vector values)
Logical Operators
The following operators will return a vector of
exactly equal to | |
not equal to | |
greater than | |
greater than or equal to | |
less than | |
less than or equal to |
Multilength vectors behave the same for these logical operators as for arithmetic operators.
x <- 1:10
y <- x+1
x < y
x == x
These operate over vectors of
not All |
||||||||||||||||
%in%
I found out about the
3 %in% 1:10
1:10 %in% 3
-3:3 %in% 1:10
(-3:3)[-3:3 %in% 1:10]
(-3:3)[!(-3:3 %in% 1:10)]
Data Input-Output
Here, we'll talk about how to load data into R, and how to write data back out to a data file. When loading a data file into R, you are just loading it into the R workspace. Any alterations or modifications you make to the data frame will not be reflected in the file in your system, just in the copy in the R workspace.
Input
For the most part, data that will be loaded into R will be in the form of data tables. Ideally, they will be plain text, and delimited either by tabs or other whitespace, or by commas (.csv files).
The two most likely commands to be used will be
These
For example, we could read in the data in cherokeeVOT from Johnson 2008 ( here again [ZIP])as follows.
vot <- read.delim("PATH/cherokeeVOT.txt")
#or
vot <- read.table("PATH/cherokeeVOT.txt",sep = "\t",header = T)
It appears that on Windows machines, the proper file delimiter is still "/", even though if you copy the path, it will be delimited by "\". If you're copying a path into the prompt with the "\" delimiter, change them all to "/" to successfully load the file.
There is also the very handy
vot <- read.delim(file.choose())
I'm not actually a huge fan of the
votPATH <- file.choose()
vot <- read.delim(votPATH)
Output
If you want to save data a data frame that you've edited or enriched, use
sep - -- If using
write.table() , I set this to "\t", which writes a tab delimied file. The default is a single space: " " quote - -- I set this to F. By default, character strings, and the character labels for factors will be closed in quotes. I don't like this so much, and it makes it harder to read by eye or for some other script.
row.names - -- I set this to F. By default, it will write a column for the row names in the data frame. These are usually meaningless, just being the row number.
write.delim(vot,file = "PATH/vot.txt",sep = "\t",quote = F,row.names = F)
Again, on a Windows machine, use "/" even though the standard file delimiter is "\".
Data Frames
Data Frames are the format you are most likely to access data from in R.
Lets focus on the Cherokee VOT data from Johnson 2008. To see how many rows and columns it has, use
ncol(vot)
nrow(vot)
There are 3 variables coded for 44 observations in the data. To see the column names, which are the same as the variable names here, use
colnames(vot)
To see the first 6, or last 6 observations, use
head(vot)
tail(vot)
To get 5 number summaries of numerical data in a data frame and counts for factors, use
summary(vot)
Data Frame Indexing and Subsetting
Indexing in data frames can work almost exactly like it does in matrices. To see the first 6 rows of a data frame, you can use
vot[1:6,]
Likewise, you can see a column either by using
vot[ ,"VOT"]
head(vot[ ,c("VOT","Consonant","year")])
head(vot[ ,c("year","VOT","Consonant")])
Frequently, you might want to pull out subets of the data frame, or subsets of variables based on certain conditons. There are a few ways to do this. First, you could generate a vector of boolean values, and use this as an index for the data frame.
vot$VOT > 100
vot[vot$VOT > 100,]
vot[vot$Consonant == "t",]
If you just want to see variable values for a given factor level, there are at least two ways to do this.
vot[vot$year == 1971,]$VOT
#or
vot$VOT[vot$year == 1971]
The first selects a subset of the data frame, then takes the VOT variable from that subset. The second takes the variable first, the takes a subset of that. I'm not sure if one approach is more efficient, common or preferable than the other.
You could also use
subset(vot,VOT>100 & Consonant == "k")
subset(vot,VOT>100 & Consonant == "k",select = c("VOT","year"))
subset(vot,VOT>100 & Consonant == "k")$VOT
If you want to save time by not needing to do the
Factors
For the most part, if we're using data which has some variable coded as a character string, it is actually a factor with specific levels.
A good way to grasp the structure of factors is to build one from scratch. Say we had collected class data in a study,
and coded lower working class, upper working class, and lower middle class with numeric codes, 1, 2, and 3, repectively.
First, lets simulate a sample of 10 speakers.
class <- sample(1:3, 10, replace = T)
class
Your numbers will probably be different. Right now,
factor(class)
We can also add meaninful labels to the factor levels.
class.f<-factor(class,labels = c("LWC","UWC","LMC"))
class.f
In essence, the factor vector has a representation like this:
Vector | Dictionary |
---|---|
3 | 1 = "LWC" |
1 | 2 = "UWC" |
1 | 3 = "LMC" |
2 | |
3 | |
3 | |
3 | |
2 | |
3 | |
3 |
Compare the following printed output
class
class.l<-c("LWC","UWC","LMC")
class.l
class.l[class]
class.f
The factor vector that we built is unordered. That is, the factor levels define groups rather than an ordering. It is possible to also define ordered factors, which will strongly affect model fitting. I haven't used ordered factors much myself, so I don't really understand how they work (but I should).
levels()
The
> levels(class.f) > old.levels<-levels(class.f) > new.levels <- c("LowerWorkingClass","UpperWorkingClass","LowerMiddleClass")
> levels(class.f)<-new.levels
> class.f
relevel()
Don't use
x <- factor(1:3, labels = c("one","two","three"))
x
levels(x) <- c("two","one","three")
x
All we've done is change the labels of each level, which ends up scrambling their meaningful labels. If we really wanted
"two" to be the first level (this will be important for fitting regression models), we should use
x <- factor(1:3, labels = c("one","two","three"))
x
x <- relevel(x, "two")
x
reorder()
There is also a pretty nifty function called
x <- as.factor(c(rep("A",10),rep("B",10)))
value <- c(rnorm(10,mean = 1), rnorm(10, mean = -1))
tapply(value, x, mean)
x <- reorder(x, value, mean)
tapply(value, x, mean)