# Summer 2010 — R: Summarization and Aggregation

<- R: Basics | Home | R: Functions-> |

## Contents

## Intro

Raw data is usually useful in its static state. Therefore, it's important to have a good toolset for summarizing and aggregating data.

## Basics

There are quite a few good summarization functions in base R. The simplest are

Briefly, I read 93 strings of 7 numbers formatted as telephone numbers (e.g. 123-4567).

The data is tab delimited, so we can load it with the following code.

dur <- read.delim("http://www.ling.upenn.edu/~joseff/rstudy/data/Joe.Durr1.txt")

### table()

The

## How many of each number?

table(dur$Label)

##How many of each position?

table(dur$Pos)

##How many of each number in each position?

table(dur$Label, dur$Pos)

##Add some meaningful names to that table?

table(Label = dur$Label, Position = dur$Pos)

To get proportions in each cell of a table, wrap the

table(Label = dur$Label, Position = dur$Pos)

## Row-wise

prop.table(table(Label = dur$Label, Position = dur$Pos),1)

## Column-wise

prop.table(table(Label = dur$Label, Position = dur$Pos),2)

## Grand proporiton

prop.table(table(Label = dur$Label, Position = dur$Pos))

### tapply()

The

## Geometric mean duration by number

exp(tapply(log(dur$Duration), dur$Label, mean))

## Median duration by number

tapply(dur$Duration, dur$Label, median)

## Total seconds spent on each number

tapply(dur$Duration, dur$Label, sum)

The nifty thing about

## Geometric mean duration by number by position

exp(tapply(log(dur$Duration), list(dur$Label, dur$Pos), mean))

## Sum of durations by number by position

tapply(dur$Duration, list(dur$Label, dur$Pos), sum)

## Split Apply Combine

Frequently, you'll want to do something a bit more complex than

As the **split** the data up into meaningful subsets, **apply** the same function or operation to every subset, then
**combine** the results of every application into a single output. There already exist a bunch of functions in base R for this
approach to aggregation (like

The core of

Structure | |

array | |

data frame | |

list |

Most of the data structures we work with are data frames, and usually we'll want to keep working with data frames, so

data - The data frame containing the data.
variables - The variables to split the data frame up by.
fun - The function to apply to the data frame subsets.

The simplest function we can apply to an entire data frame and get meaningful results is

## Split by Label

ddply(dur, .(Label), nrow)

## Split by Pos

ddply(dur, .(Pos), nrow)

## Split by label and Pos

ddply(dur, .(Label, Pos), nrow)

You can also create variables on the fly to split the data frame by. For example, I could split the data on various binnings
of

## cut() Duration into 10 bins

ddply(dur, .(DurBins = cut(Duration, 10)), nrow)

## round() Duration to the nearest 1/10th second

ddply(dur, .(DurBins = round(Duration, 1)), nrow)

## transform() and summarize()

Some other useful out-of-the-box functions to use with

### transform()

## Subset of just "zero" in first position

sub01 <- subset(dur, Label == 0 & Pos == 1)

## transform duration to log centered

sub01 <- transform(sub01, Dur_center = log(Duration) - mean(log(Duration)))

After using

What if you wanted to apply this transformation to every subset of the data defined by unique combinations of

## Log center every subset defined by Label:Duration

dur <- ddply(dur, .(Label,Pos), transform, Dur_center = log(Duration) - mean(log(Duration)))

Having applied this transformation, we could see how atypical a given recitation of a number in a given position is compared to some other variable (ideally, we'd use z-scores, or residuals for this). For example, how did my speech rate vary over the length of the recording?

### summarize()

## summarise dur

summarise(sub01, Dur_mean = mean(Duration), Dur_sum = sum(Duration), Dumb = mean(end) - mean(Begin))

Rather than returning the entire original data frame with additional columns,

dur_summary <- ddply(dur, .(Label, Pos), summarize, Dur_mean = mean(Duration), Dur_sum = sum(Duration), Dumb = mean(end) - mean(Begin))

## Arbitrary Functions

### Repetition Example

You can also write your own arbitrary functions that take data frames as arguments, and pass them as arguments to

What if I wanted to test the hypothesis that if I read a number sequence where one of the numbers is immediately repeated (e.g. 123-4566),
the second repetition of the number will be shorter. Let's write a special function to be passed to

codeRepeat <- function(df){ ## Make sure the dataframe is properly ordered df <- df[order(df$Pos),] ## If the difference between two adjacent labels ## is 0, they are the same label. The first label ## cannot have been a repeat. df <- transform(df, Repeat = c(NA, diff(df$Label) == 0)) return(df) }

Now, pass the

## apply the codeRepeat() function for every

## string of numbers, identified by Num

dur <- ddply(dur, .(Num), codeRepeat)

Now that we've got the data coded, it's only a matter of running the proper analysis.

mod <- lm(log(Duration) ~ factor(Pos)*factor(Label)*Repeat, data = dur)

anova(mod)

Florian Schwarz had an excellent suggestion for doing this same analysis quickly with data
that is coded as a factor. If

### Normalization Example

I wrote this function to normalize formant data according to the ANAE method (there is also an ANAE normalization function in the

AnaeNormalize <- function(df, G = 6.896874, formants = c("F1","F2")){ m <- length(formants) n <- nrow(df) S <- sum( log(df[ , formants]) ) / (m*n) F <- exp(G - S) norm_formants <- paste(formants, "norm", sep = "_") for(i in seq(along = formants)){ df[[ norm_formants[i] ]] <- df[[ formants[i] ]] * F } return(df) }

We can test this function out using some data from the SLAAP project's NORM sample data sets.

## read in the data

samp <- read.delim("http://ncslaap.lib.ncsu.edu/tools/norm/downloads/CentralOhioAndTyrone.txt")

## apply the function

samp_norm <- ddply(samp, .(speaker), AnaeNormalize)

## Examine normalization efficacy

## Step 1: Produce means

samp_means <- ddply(samp_norm, .(speaker, vowel.frame), summarize, MF1 = mean(F1), MF2 = mean(F2), MF1_norm = mean(F1_norm), MF2_norm = mean(F2_norm))

## Step 2: Plot them

unnorm <- qplot(-MF2, -MF1, data = samp_means, geom = "text", label = vowel.frame, size = I(2), main = "Unnormalized")+ facet_wrap(~speaker)

norm <- qplot(-MF2_norm, -MF1_norm, data = samp_means, geom = "text", label = vowel.frame, size = I(2), main = "Normalized")+ facet_wrap(~speaker)

print(unnorm)

print(norm)

### Models example

As a demonstration of the various things you can do with

First, I need to define a function.

fitBuckModels <- function(df){ mod <- glm(DepVar ~ Freq.z, data = df, family = binomial) return(mod) }

Next, load the buckeye data.

buck <- read.csv("http://www.ling.upenn.edu/~joseff/rstudy/data/sbuck.csv")

## zscore the frequency

buck$Freq.z <- (buck$Log_Freq - mean(buck$Log_Freq)) / sd(buck$Log_Freq)

I'll store every model fit in a list, so I'll need to use

buck_models <- dlply(buck, .(Speaker), fitBuckModels, .progress = "text")

buck_models[1]

Now, I'll access each of those models, and extract their coefficients.

buck_coefs <- ldply(buck_models, coef)

### Hadley Wickham's Suggestions

Hadley Wickham (author of he

- Extract a subset of the data for which it is easy to solve the problem.
- Solve the problem by hand, checking results as you go.
- Write a function that encapsulates the solution.
- Use the appropriate plyr function to split up the original data, apply the function to each piece and join the pieces back together.