Review

Yesterday, we already started learning about R’s data structures when we were talking about vectors. We learned how to create a vector, and access an item in a vector using indices.

Data frames

Yesterday we said that one way to think of data frames is that they’re like a table. While this is sort of true, data frames are a little more specific. A data frame is technically a list of column vectors. The only reason this might matter is that columns have to be the same type (numeric, character, boolean), but the list of column vectors can contain any mix of column vector types. This also means that each column has to have the same length.

Long format

Most analyses in R need data to be in long format. Hadley Wickham’s paper on tidy data is a great read for principles of data formatting. The basic important takeaways are:


  1. Each variables forms a column

  2. Each observation forms a row

  3. Each type of observational unit forms a table.


We’ll talk about each principle separately, using the hsb2 data set from UCLA. The data file contains 200 observations from a sample of high school students. It includes demographic information on the students, scores on standardized tests, and school type. First, let’s load in the data, and look at the first few rows. I usually use head() for this, but we can also use indices to look at the first (any number) rows. Although UCLA calls this dataset hsb2, I don’t find this name very informative, so I’m going to rename it when I read it into R. I’m opting for the efficient hs_tests, because high_school_tests is a little verbose - but you should name it whatever makes most sense to you!

hs_tests <- read.table('http://www.ats.ucla.edu/stat/r/faq/hsb2.csv', header=T, sep=",")
hs_tests[1:10,]
##     id female race ses schtyp prog read write math science socst
## 1   70      0    4   1      1    1   57    52   41      47    57
## 2  121      1    4   2      1    3   68    59   53      63    61
## 3   86      0    4   3      1    1   44    33   54      58    31
## 4  141      0    4   3      1    3   63    44   47      53    56
## 5  172      0    4   2      1    2   47    52   57      53    61
## 6  113      0    4   2      1    2   44    52   51      63    61
## 7   50      0    3   2      1    1   50    59   42      53    61
## 8   11      0    1   2      1    2   34    46   45      39    36
## 9   84      0    4   2      1    1   63    57   54      58    51
## 10  48      0    3   2      1    2   57    55   52      50    51

This data set is in wide format, meaning that it contains a single row for each student, NOT a single row for each observation. We can think about each student having a few different measurements that belong to the same variable. They take a reading test, a math test, a writing test etc. This is actually one kind of variable (test score), with different variants (math, reading, writing). In other words, the tests belong in the same column, to adhere to Principle 1.

At the same time, each type of score can be thought of as a single observation of the variable test measurement. Each of these scores should be on its own line. The score is the measurement, and the type of test is an attribute about each specific observation – so we’ll need one column to represent the scores and another column to represent the attribute test type.

We want to change the data frame so that it puts each of these scores on its own row, like so:

##     id female race ses schtyp prog variable value
## 99   1      1    1   1      1    3     read    34
## 299  1      1    1   1      1    3    write    44
## 499  1      1    1   1      1    3     math    40
## 699  1      1    1   1      1    3  science    39
## 899  1      1    1   1      1    3    socst    41
## 139  2      1    1   2      1    3     read    39
## 339  2      1    1   2      1    3    write    41
## 539  2      1    1   2      1    3     math    33
## 739  2      1    1   2      1    3  science    42
## 939  2      1    1   2      1    3    socst    41

In the data frame above, we can see that there are 5 rows per student id: one row for each subject they were tested in. This data frame now adheres to principles 1 and 2 of tidy data. Principle 3 is one that I’m going to flout for now, because I usually need data in one table when I analyze it. For data analysis, Principles 1 and 2 are the most important. Principle 3 is primarily important for storing data.

So how do we get there?

Reshaping data

One of the most powerful tools for getting our data into the shape we want is the reshape2 package. First, we need install the reshape2 package by using install.packages(). Once it’s installed, we then tell R to open that package by using library(). Note that the package name is put in "" for the installation, but not after it’s installed. Packages only ever need to be installed once - but library() is used every time we re-open an R session.

install.packages("reshape2")
library(reshape2)

We’ll be using the melt() function, which takes 3 arguments (and some possible optional arguments):


  1. The data frame you’re adjusting

  2. The id variables (columns that should remain columns)

  3. The measure variables (columns that should be turned into rows)


hs_tests_long <- melt(hs_tests, id = c("id", "female", "race", "ses", "schtyp", "prog"), measure = c("read", "write", "math", "science", "socst"))
head(hs_tests_long)
##    id female race ses schtyp prog variable value
## 1  70      0    4   1      1    1     read    57
## 2 121      1    4   2      1    3     read    68
## 3  86      0    4   3      1    1     read    44
## 4 141      0    4   3      1    3     read    63
## 5 172      0    4   2      1    2     read    47
## 6 113      0    4   2      1    2     read    44

We’re keeping "id" the same, because the students’ id is obviously an id variable. "female", "race", "ses", "schtyp", and "prog" are kept the same, because they also don’t vary by student id. Student #70 is male, and will continue to be male. He also won’t have vary his race and socio-economic status. Because this data set only contains one time measurement (students were only measured at one point in their life), they only have one school type and program.

We can see that R automatically sorts the data by variable. I prefer my data sorted by student id (it just is easier for me to think about it that way). This is easy to achieve, with order(). Let’s break this command down piece by piece. On its own, order(hs_tests_long$id) runs the function order() on the rows of hs_tests_long, and sorts by the column id. When we nest that command inside the row argument of hs_tests_long[ , ], we’re telling R to sort the rows according what we’ve put into order(). Don’t forget the comma before the closing bracket! We want the new data frame to have all the columns in it. The last step is to assign this re-ordered data frame to a new name, using <-.

hs_sorted <- hs_tests_long[order(hs_tests_long$id), ]
head(hs_sorted)
##     id female race ses schtyp prog variable value
## 99   1      1    1   1      1    3     read    34
## 299  1      1    1   1      1    3    write    44
## 499  1      1    1   1      1    3     math    40
## 699  1      1    1   1      1    3  science    39
## 899  1      1    1   1      1    3    socst    41
## 139  2      1    1   2      1    3     read    39