This is how I went about answering the challenge questions from yesterday. If you got the same answer in a different way, that’s totally fine.

1. Style and Gender

Create a plot to see whether style has an effect on the proportion of “In”. Do men and women behave differently in different speech styles?

To do this, I’ll first re-load the ing data from yesterday, and the packages that I’ll need to manipulate and plot the data.

library(devtools)
library(grammarOfVariationData) 
library(dplyr)
library(ggplot2)
head(ing)
##       Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 54    going     In   careful progressive         vowel   f   2     Irish
## 55   giving     In   careful progressive         vowel   f   2     Irish
## 56 upcoming    Ing   careful   adjective         vowel   f   2     Irish
## 57    going     In   careful progressive         vowel   f   2     Irish
## 58 fighting    Ing   careful  participle        apical   f   2     Irish
## 59    going    Ing narrative progressive         vowel   f   2     Irish
##    Prop      prop
## 54  0.5 0.5065847
## 55  0.5 0.5065847
## 56  0.5 0.5065847
## 57  0.5 0.5065847
## 58  0.5 0.5065847
## 59  0.5 0.5065847

First, I’ll create a summary data frame using dplyr. Here, I’m taking the ing data frame, splitting it up according to style and gender, and calculating the proportion of In tokens in each style for each gender.

style_gend_ing <- ing %>%
  group_by(Style, Sex) %>%
  summarise(Prop = mean(DepVar == "In"))

Next, I’ll plot it. Yesterday we didn’t assign the outputs of our ggplot() functions to a variable, but in general practice, I prefer to do so. This allows me to call up graph objects later without having to re-run the code. I’ll save this plot as style_gend_plot, since it plots style and gender.

As you can see, I’ve added some totally unnecessary cosmetic arguments to my plot. Within geom_bar, I set the transparencey to 70% with alpha for no reason at all, and added a black border around the bars with color = "black" because I like it better. Note that if you’re saving your graphs as objects in your workspace, you have to call them in order to see them in the plotting window.

style_gend_plot <- ggplot(style_gend_ing, aes(x = Style, y = Prop*100, fill = Sex)) + 
  geom_bar(stat = "identity", position = "dodge", alpha=.7, color = "black") +
  labs(x = "Speech Style", y = "Percent `In`", color = "Gender") +
  theme_bw()

style_gend_plot      ## have to call the plot separately

One point that this plot brings up about geom_bar() defaults is that if there is only one category of color for a given value on the x-axis (e.g., only females talking about language, as we see here), then ggplot will plot that bar twice as fat, which looks terrible.

A solution to this problem is to manually add in an “m” variant in the summary data frame, with a proportion value of 0. This way ggplot will plot a bar with height 0, instead of the weirdly proportioned single bar that we see above.

style_gend_ing <- ing %>%
  group_by(Style, Sex) %>%
  summarise(Prop = (mean(DepVar == "In")))

style_gend_ing[16,] <- c("language", "m", 0)       ## add in row for males talking about language

style_gend_plot <- ggplot(style_gend_ing, aes(x = Style, y = 100*(as.numeric(Prop)), fill = Sex)) + 
  geom_bar(stat = "identity", position = "dodge", alpha=.7, color = "black") +
  labs(x = "Speech Style", y = "Percent `In`", color = "Gender", 
       title = "Percent 'IN' by Speech Style and Gender") +
  theme_bw() 

style_gend_plot

This seems to have fixed the problem. Of course it’s worth remembering that a bar graph of 0 in this case is actually a little misleading, since it means that males used “IN” 0% of the time when they were talking about language. But instead of an actual 0 there, our data has NA there, because we don’t have any data where males are talking about language. ggplot’s default is to make this NA apparent by just extending the width of the bar graph. I don’t like that option because I think it looks bad - but creating a dummy row in this case also isn’t necessarily ideal, since it is misleading.

Some solutions might include:

  1. Adding “NA” above the male-language bar, so readers know that there is no data there
  2. Adding total n’s above each bar, so readers know how many tokens these proportions are based on in the first place

I prefer 2, since I think it’s most informative, but there’s always a trade-off with informativity vs unnecessarily cluttering the graph. This takes another step in the dplyr function, to add the number of tokens for each category. We can accomplish this with length(), which we saw in base R, or with n(), which is a dplyr-specific function for finding the number of tokens in each category.

style_gend_ing <- ing %>%
  group_by(Style, Sex) %>%
  summarise(Prop = (mean(DepVar == "In")),
            Count = n())                              ## add in token count

style_gend_ing[16,] <- c("language", "m", 0, 0)       ## add in row for males talking about language

style_gend_plot <- ggplot(style_gend_ing, aes(x = Style, 
                                              y = 100*(as.numeric(Prop)), 
                                              fill = Sex, 
                                              ## define label for geom_text
                                              label = paste("n=", Count, sep = ""))) +  
  geom_bar(stat = "identity", position = "dodge", alpha=.7, color = "black") +
  ## dodge to match columns, vertical adjust to be above them.
  geom_text(position = position_dodge(width = .9), vjust = -0.5, size = 3) + 
  labs(x = "Speech Style", y = "Percent `In`", color = "Gender", 
       title = "Percent 'IN' by Speech Style and Gender") +
  theme_bw() 

style_gend_plot
## ymax not defined: adjusting position using y instead

Okay so that was supposed to be the easy challenge - but of course the problem with making nice graphs is that there’s endless room for tinkering!

2. Add error bars

Add error bars to the grammatical category by gender plot.

This documentation of geom_errorbar and this Stack Overflow question on calculating standard error will be helpful. Hint: answers with more upvotes are generally better.

This problem required a little bit of online searching for solutions. Just adding geom_errorbar() to your original ggplot call will give you an error message - so that means you have to check out the documentation of geom_errorbar() to find out what’s missing. You can see that geom_errorbar() requires x, ymax, and ymin to be specified. Furthermore, we can see in the example plot that ymax is specified by already having the values for standard error in the data frame.

Okay - so this means we need to add the standard error into our summary data frame. No problem, we’ll just ask Google how to find the standard error of the mean in R. That will take us to the Stack Overflow thread about this problem, where we learn that there is no se() function in base R, but that we can easily makeo our own function. (This brings up an excellent point about Stack Overflow, by the way: it’s not always the first answer that’s the best answer for you!)

So I’ll copy that function that Ian Fellows supplied, and use it to calculate the standard error in my summary data frame.

std <- function(x) sd(x)/sqrt(length(x))     ## define the function std()

gram_gend_ing2 <- ing %>%
  group_by(GramStatus, Sex) %>%
  summarise(Prop = mean(DepVar == "In"),
            SE = std(DepVar == "In")) %>%    ## use std() to calculate standard error
  mutate(ymax = Prop + SE,                   ## add ymax and ymin into data frame
         ymin = Prop - SE)

This code adds another layer onto the dplyr function, with mutate. One of the amazing features of dplyr is that because of chaining, you can calculate a new variable in your summary data frame with summarise (as I did above with Prop and SE), and then turn around and immediately use that new variable to calculate another variable(here, calculating ymax and ymin using Prop and SE).

head(gram_gend_ing2)
## Source: local data frame [6 x 6]
## Groups: GramStatus
## 
##   GramStatus Sex       Prop         SE      ymax       ymin
## 1  adjective   f 0.07692308 0.03731317 0.1142363 0.03960990
## 2  adjective   m 0.18750000 0.10077822 0.2882782 0.08672178
## 3     during   f 0.71428571 0.18442778 0.8987135 0.52985794
## 4     during   m 1.00000000 0.00000000 1.0000000 1.00000000
## 5     gerund   f 0.27500000 0.07149951 0.3464995 0.20350049
## 6     gerund   m 0.23287671 0.04981147 0.2826882 0.18306524

Looks great, now we just have to plot it! I prefer percentages to proportions, because I think they’re easier to read, so I just need to multiply my proportions by 100 to get the percent. I could have changed this in the underlying data frame (preferable, if I were to continue working on ing), but it’s also possible just to adjust it in the ggplot call:

gram_ing_plot <- ggplot(gram_gend_ing2, aes(x = GramStatus, y = Prop*100, fill = Sex)) + 
  geom_bar(stat = "identity", position = "dodge") +
  geom_errorbar(aes(ymax = ymax*100, ymin = ymin*100), 
                position=position_dodge(width=.9), width=.5) +
  labs(x = "\nGrammatical Category", y = "Percent 'In'", color="Gender",
       title = "Rates of 'In' by Gender and Grammatical Category \n") + 
  scale_fill_brewer(breaks = c("f", "m"),
                    labels = c("Female", "Male"),
                    palette = "Paired") +
  theme_bw()

gram_ing_plot

3. Plot categorical values and add a summary statistic

Plot the rate of “In” over time (using Age as a proxy for DOB)

age_ing <- ing %>%
  group_by(Age) %>%
  summarise(Prop = mean(DepVar == "In"))

age_ing_plot <- ggplot(ing, aes(x = Age, y = DepVar, color = Sex, group = Sex)) + 
  #geom_bar(stat = "identity", position = "dodge") +
  geom_point(position=position_jitter(width=.1, height=.02), alpha=.05) + 
  stat_smooth()

age_ing_plot
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.