Week 2

Week 2

Home <- Week 1Week 3->

Contents

  1. References
  2. Factors
    1. levels()
    2. relevel()
  3. Converting Types
    1. Converting Characters to...
      1. ...factors
      2. ...numerical
    2. Converting Factors to...
      1. ...character
      2. ...numerical
    3. cut()
  4. Data Input and Output
    1. Input
    2. Output
  5. Data Frames
    1. Data Frame Indexing and Subsetting
    2. More Summary Techniques

References

Introductory Statistics with R: Chapter 1
Analyzing Linguistic Data: A practical introduction to statistics: Chapter 1

While we settle

Install and load the ISwR and languageR libraries.

> install.packages(c("ISwR","languageR"),dependencies = T)
> library(ISwR)
> library(languageR)

Also, download the zip file containing data from Johnson 2008's chapter on Phonetics: here [ZIP]

Factors

For the most part, if we're using data which has some variable coded as a character string, it is actually a factor with specific levels.

A good way to grasp the structure of factors is to build one from scratch. Say we had collected class data in a study, and coded lower working class, upper working class, and lower middle class with numeric codes, 1, 2, and 3, repectively. First, lets simulate a sample of 10 speakers. sample() will simulate such a sample. See ?sample for more details.

> class <- sample(1:3, 10, replace = T)
> class
[1] 3 1 1 2 3 3 3 2 3 3

Your numbers will probably be different. Right now, class is a numerical vector. The factor() function will convert it to a factor vector.

> factor(class)
[1] 3 1 1 2 3 3 3 2 3 3
Levels: 1 2 3

We can also add meaninful labels to the factor levels.

> class.f<-factor(class,labels = c("LWC","UWC","LMC"))
> class.f
[1] LMC LWC LWC UWC LMC LMC LMC UWC LMC LMC
Levels: LWC UWC LMC

In essence, the factor vector has a representation like this:

VectorDictionary
31 = "LWC"
12 = "UWC"
13 = "LMC"
2
3
3
3
2
3
3

Compare the following printed output

> class
[1] 3 1 1 2 3 3 3 2 3 3
> class.l<-c("LWC","UWC","LMC")
> class.l
[1] "LWC" "UWC" "LMC"
> class.l[class]
[1] "LMC" "LWC" "LWC" "UWC" "LMC" "LMC" "LMC" "UWC" "LMC" "LMC"
> class.f
[1] LMC LWC LWC UWC LMC LMC LMC UWC LMC LMC
Levels: LWC UWC LMC

The factor vector that we built is unordered. That is, the factor levels define groups rather than an ordering. It is possible to also define ordered factors, which will strongly affect model fitting. I haven't used ordered factors much myself, so I don't really understand how they work (but I should).

levels()

The levels() command accesses the levels of a factor. It will either produce a character vector of the current levels, or allow you to edit the levels by assigning a new character vector to it.

> levels(class.f)
[1] "LWC" "UWC" "LMC"
> old.levels<-levels(class.f)
> old.levels
[1] "LWC" "UWC" "LMC"
> new.levels <- c("LowerWorkingClass","UpperWorkingClass","LowerMiddleClass")
> levels(class.f)<-new.levels
> class.f
[1] LowerMiddleClass LowerWorkingClass LowerWorkingClass UpperWorkingClass LowerMiddleClass LowerMiddleClass LowerMiddleClass UpperWorkingClass LowerMiddleClass LowerMiddleClass
Levels: LowerWorkingClass UpperWorkingClass LowerMiddleClass

relevel()

Don't use levels() to try to change the order of levels. Here's a demonstration of what will happen

> x <- factor(1:3, labels = c("one","two","three"))
> x
[1] one two three
Levels: one two three
> levels(x) <- c("two","one","three")
> x
[1] two one three
Levels: two one three

All we've done is change the labels of each level, which ends up scrambling their meaningful labels. If we really wanted "two" to be the first level (this will be important for fitting regression models), we should use relevel().

> x <- factor(1:3, labels = c("one","two","three"))
> x
[1] one two three
Levels: one two three
> x <- relevel(x, "two")
> x
[1] one two three
Levels: two one three

relevel() has reordered the levels, making "two" the first one, but has kept the labels for each level the same.

Converting Data Types

Converting Characters to..,

...factors

Character vectors cannot be used in most statistical functions in R, since they don't define groups. If a character vector is given to one of these R functions, it will automatically be converted to a factor. Here's what conversion of a character vector to a factor looks like:

> class.c <- sample(c("LWC","UWC","LMC"),10,replace = T)
> class.c
[1] "UWC" "UWC" "LWC" "LWC" "LMC" "UWC" "LMC" "UWC" "UWC" "UWC"
> as.factor(class.c)
[1] UWC UWC LWC LWC LMC UWC LMC UWC UWC UWC
Levels: LMC LWC UWC

It's basically the same kind of result as above except for the ordering of levels. Here, the levels are ordred alphabetically, wheras above they were ordered as we specified them. For our purposes, this doesn't matter too much until we get to model fitting.

...numerical

To convert a character vector to a numeric vector, use as.numeric(). It is important to do this before using the vector in any statistical functions, since the default behavior in R is to convert character vectors to factors. Be careful that there are no characters included in any strings, since as.numeric() will return NA's for these.

> age.c<-as.character(sample(18:30),10,replace = T)
> age.c
[1] "29" "18" "21" "24" "30" "23" "26" "27" "19" "28" "25" "20" "22"
> as.numeric(age.c)
[1] 29 18 21 24 30 23 26 27 19 28 25 20 22
>
> percentages.c<-c(1,0.2,"5%",0.05)
> percentages.c
[1] "1" "0.2" "5%" "0.05"
> as.numeric(percentages.c)
[1] 1.00 0.20 NA 0.05
Warning message:
NAs introduced by coercion

Converting Factors to...

Factors behave differently when being converted to characters vs. numbers

...character

When a factor is passed to as.character(), it returns the character label for the level.

> yob.f<-as.factor(sample(c(1969,1979,1989),10,replace = T))
> yob.f
[1] 1989 1969 1979 1969 1969 1989 1969 1989 1969 1979
Levels: 1969 1979 1989
> as.character(yob.f)
[1] "1989" "1969" "1979" "1969" "1969" "1989" "1969" "1989" "1969" "1979"

...numerical

However, when passed to as.numeric(), it returns the numerical index of the level.

> as.numeric(yob.f)
[1] 3 1 2 1 1 3 1 3 1 2

To correctly convert factors to numbers, you could first convert it to a character vector, then convert that to a numeric vector.

> as.numeric(as.character(yob.f))
[1] 1989 1969 1979 1969 1969 1989 1969 1989 1969 1979

Or, you could do this, which the help pages says is more efficient, but a little more confusing conceptually.

> as.numeric(levels(yob.f))[yob.f]
[1] 1989 1969 1979 1969 1969 1989 1969 1989 1969 1979

cut()

This is technically not a conversion of types, but a useful way to reorganize continuous numerical data into factor data. Occasionally, you may want to split a continuous variable into a categorical, or ordered one (ages into age groups, years into epochs, etc.). cut() is the tool for this job. Lets generate a sample again:

> x <- sample(1:100, 100, replace = T)
> x

So now we have a random sample of 100 samples with values between 1 and 100. What there wasn't actually an important continuous feature of x, but rather all that mattered was whether the value was low, medium or high? We could split the data into three groups like this this way

> x.f <- cut(x, breaks = 3)
> x.f

We've created a new factor based on the numeric value of x. The argument breaks defined how many bins to create. The cut points for the bins were determined by the range of x divided by the number of breaks.

The labels of each bin indicate how each boundry was defined.

Name Definition
(A,B] x > A and x ≤ B
[A,B] x ≥ A and x ≤ B
[A,B) x ≥ A and x < B

By default, cut() resorts to a (A,B] definition of levels. If you want to reverse it, to [A,B), pass False to the argument right. To have the definition [A,B] for the lowest level, past True to the argument include.lowest.

> x.f <- cut(x, breaks = 3,right = F)
> x.f
> x.f <- cut(x, breaks = 3,include.lowest = T)
> x.f

We can also give meaningful names to each level.

> x.f <- cut(x, breaks = 3,labels = c("low","medium","high"))
> x.f

If you had theoretical reasons to assume where the cutpoints should be, you can define those by hand. (include.lowest will probably be very handy here.

> x.f <- cut(x, breaks = c(1,25,44,100), include.lowest= T)
> x.f
> x.f <- cut(x, breaks = c(1,25,44,100), include.lowest= T,labels = c("low","medium","high"))
> x.f

Data Input-Output

Here, we'll talk about how to load data into R, and how to write data back out to a data file. When loading a data file into R, you are just loading it into the R workspace. Any alterations or modifications you make to the data frame will not be reflected in the file in your system, just in the copy in the R workspace.

Input

For the most part, data that will be loaded into R will be in the form of data tables. Ideally, they will be plain text, and delimited either by tabs or other whitespace, or by commas (.csv files). I'll also briefly mention how to load Excel tables (.xls) into R.

The two most likely commands to be used will be read.delim() or read.csv(). read.delim() is used to read in files delimited by tabs, and read.csv() for files delimited by commas. Both of these functions assume that the first line is a header. If this is not the case, then set the option header = F. If the file you are looking at is delimited neither by tabs nor by commas, you should try read.table() (which the first two are based on) with the option sep given the appropriate delimiter. As always see ?read.table for more info on all of these functions.

These read.table commands will read in the data as a data frame, a particular data structure in R that we'll talk about below. Be sure to assign the output of read.table() to a variable, or else it will just print the file, and that's no good to no one.

For example, we could read in the data in cherokeeVOT from Johnson 2008 ( here again [ZIP])as follows.

> vot <- read.delim("PATH/cherokeeVOT.txt")
> #or
> vot <- read.table("PATH/cherokeeVOT.txt",sep = "\t",header = T)

It appears that on Windows machines, the proper file delimiter is still "/", even though if you copy the path, it will be delimited by "\". If you're copying a path into the prompt with the "\" delimiter, change them all to "/" to successfully load the file.

To load an excel file into R, you'll have to install load the gdata library for access to the read.xls() function. Once you've installed and loaded gdata, see ?read.xls for more details. One thing to note is that read.xls() will require that Perl is properly installed and configured on your system. All in all, it's probably easier just to save your excel file as a .csv file for loading into R. It's also probably advisable to store your data, long term, in a non-proprietary format, like raw text.

Output

If you want to save data a data frame that you've edited or enriched, use write.csv() or write.table(). As the name suggests, write.csv() writes the data frame to a comma separated file. Note, if one of your variables has a comma in it (like "City,State"), do not write to a csv, because it will ruin your life. I personally don't like a few of the defualt settings to these commands, so here's what I change:

sep
-- If using write.table(), I set this to "\t", which writes a tab delimied file. The default is a single space: " "
quote
-- I set this to F. By default, character strings, and the character labels for factors will be closed in quotes. I don't like this so much, and it makes it harder to read by eye or for some other script.
row.names
-- I set this to F. By default, it will write a column for the row names in the data frame. These are usually meaningless, just being the row number.
The final command will usually look something like this:

> write.delim(vot,file = "PATH/vot.txt",sep = "\t",quote = F,row.names = F)

Again, on a Windows machine, use "/" even though the standard file delimiter is "\".

There does not appear to be a write.xls() command of any variety.
**Oops! Spoke too soon. As with almost anything R, if you ever say "There is no command for X," That probably just means you didn't Google for it.

Data Frames

Data Frames are the format you are most likely to access data from in R. Columns of a data frame represent different variables, and ideally every row will represent a different observation.

A note on data formatting: R really prefers to have data in a long format. I was recently working with someone on managing and visualizing data they had from an experiment. They had it formatted so that every row was a trial, and the columns indicated subject and trial information, and then a long list of columns to the side recording observations for every timebin of the trial. This is not an ideal format. Every row should represent a different observation, and in this case every timebin counted as a different observation.

Lets focus on the Cherokee VOT data from Johnson 2008. To see how many rows and columns it has, use nrow() and ncol().

> ncol(vot)
[1] 3
> nrow(vot)
[1] 44

There are 3 variables coded for 44 observations in the data. To see the column names, which are the same as the variable names here, use colnames().

> colnames(vot)
[1] "VOT" "year" "Consonant"

To see the first 6, or last 6 observations, use head() and tail()

> head(vot)
VOT year Consonant 1 67 1971 k 2 127 1971 k 3 79 1971 k 4 150 1971 k 5 53 1971 k 6 65 1971 k
> tail(vot)
VOT year Consonant 39 106 2001 t 40 54 2001 t 41 49 2001 t 42 56 2001 t 43 58 2001 t 44 97 2001 t

To get 5 number summaries of numerical data in a data frame and counts for factors, use summary(). (summary() is a general purpose function, and generates unique and interesting output for many different objects.)

> summary(vot)
VOT year Consonant Min. : 45.00 Min. :1971 k:21 1st Qu.: 71.50 1st Qu.:1971 t:23 Median : 81.50 Median :2001 Mean : 96.45 Mean :1989 3rd Qu.:120.25 3rd Qu.:2001 Max. :193.00 Max. :2001

It looks like the year variable has been read into R as a numeric variable. On further inspection, this doesn't seem appropriate. To access any variable in a data frame, you can use $ followed by the variable name.

> vot$year
[1] 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 1971 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001

It looks like measurements were takend at two dates, one in 1971, and one in 2001 (this is, in fact, the description of the data in Johnson 2008,Ch 1) This data would be more appropriately represented as a factor. Convert types this way.

> vot$year <- as.factor(vot$year)
> summary(vot)

You'll see now that the summary for the year variable is now a count, rather than a 5 number summary.

The juul dataset in the library ISwR is another great exercise in properly formatting data, as discussed in Dalgaard 2008. If you look at a summary of juul, you'll see that a number of categorical variables have been coded with numeric codes, specifically menarche (which should be Yes/No/NA), Sex (which should be male/female/NA) and tanner (which should be I/II/III/IV/V/NA). The following code will convert them to meaningully named factors.

> juul$sex<-factor(juul$sex,labels = c("male","female"))
> juul$menarche<-factor(juul$menarche,labels = c("No","Yes"))
> juul$tanner <- factor(juul$tanner, labels = c("I","II","III","IV","V"))
> summary(juul)

It's also a good idea to inspect data frames to see if tokens which should be coded with NA are actually coded 0. This is pretty bad for categorical variables, and just awful for numerical variables.

Data Frame Indexing and Subsetting

Indexing in data frames can work almost exactly like it does in matrices. To see the first 6 rows of a data frame, you can use head() -or- data[1:6,]

> vot[1:6,]
> juul[1:6,]

Likewise, you can see a column either by using $, or giving the column index.

Frequently, you might want to pull out subets of the data frame, or subsets of variables based on certain conditons. There are a few ways to do this. First, you could generate a vector of boolean values, and use this as an index for the data frame.

> vot$VOT > 100
> vot[vot$VOT > 100,]
> vot[vot$Consonant == "t",]

If you just want to see variable values for a given factor level, there are at least two ways to do this.

> vot[vot$year == 1971,]$VOT
> #or
> vot$VOT[vot$year == 1971]

The first selects a subset of the data frame, then takes the VOT variable from that subset. The second takes the variable first, the takes a subset of that. I'm not sure if one approach is more efficient, common or preferable than the other.

You could also use subset() function. It takes a data frame as its first argument, a conditional expression as its, second, and an optional selection argument, which takes a vector of variable names to return as an argument

> subset(vot,VOT>100 & Consonant == "k")
> subset(vot,VOT>100 & Consonant == "k",select = c("VOT","year"))
> subset(vot,VOT>100 & Consonant == "k")$VOT

If you want to save time by not needing to do the $ reference for a data frame you're working with a lot, you can use the attach() command, which will allow you to directly reference the variable names. Note, however, that any type conversions you make to a data frame's variable will not be reflected in the attached version. Once you've finished using the data frame, you can remove it from the search path with detach().

More Summary Techniques

There are a few more kinds of summary statistics you may want to know, involving counts, crosstabulations and means.

The table() function will produce counts for any factor it is given.

> table(class.f)
> table(vot$year)
> table(vot$Consonant)

You can also use table() to generate crosstabulations. The function xtabs() does the same thing, but takes a formula (which we haven't talked about) as an argument.

> table(vot$Consonant,vot$year)
> xtabs(~vot$Consonant+vot$year)

For multiway tables, the formatting of table() and xtabs() gets kind of ugly. You could try ftable() for these.

> table(juul$sex,juul$menarche,juul$tanner)
> xtabs(~juul$sex+juul$menarche+juul$tanner)
> ftable(juul$sex,juul$menarche,juul$tanner)

If you'd like to see a percentage breakdown of a crosstabulation, pass the output of table() to prop.table(). You'll also need to define along which dimension you want the percentages to be calculated. 1 = row-wise, 2 = column-wise

> prop.table(table(vot$Consonant, vot$year),1)
> prop.table(table(vot$Consonant, vot$year),2)

If there is some other metric you would like to see, like mean VOT per consonant or per year, use tapply(). It takes as its first argument a numeric vector, for its second a factor, and for its third an arithmetic function.

> tapply(vot$VOT,vot$year,mean)
> tapply(vot$VOT,vot$year,median)
> tapply(vot$VOT,vot$year,sd)
> tapply(vot$VOT,vot$year,max)
> tapply(vot$VOT,vot$Consonant,mean)
> tapply(vot$VOT,vot$Consonant,median)
> tapply(vot$VOT,vot$Consonant,sd)
> tapply(vot$VOT,vot$Consonant,max)

If you give a list (another data structure we haven't talked about) as the second argument to tapply, you'll get back a tabular response.

> tapply(vot$VOT,list(vot$Consonant,vot$year),mean)

It looks like something strange happened to the VOT of /k/ between 1971 and 2001.