Week 5-6

# Week 5-6

 Home <- Week 4 Week 7->

## References

Dalgaard 2008
Johnson 2008
Gelman and Hill 2008

## Continuous Data

The topic this week will be what to do with continuous data as the response, or dependant variable. We'll look at correlation and simple regression to see what to do when your predictor, or independant variable is also continuous, and at ANOVA and Multiple Regression for when they are categorical.

## Correlation

I'm going to walk through the description of the Pearson product-moment correlation coefficient (r) from Johnson 2008, because it makes it clear why standard correlation measures rely upon assumptions of linearity and normality.

First, load the ISwR library (or install it if you haven't already), and attach the rmr data frame. If you look at it with summary(), you can see that it data on the body weight and energy consumption of a few subjects. Let's plot it with body weight on the x-axis.

library(ISwR)
attach(rmr)
summary(rmr)
plot(body.weight,metabolic.rate)

So, they look very correlated. To measure correlation, we want to see if the magnitude of changes in x are closely related to the magnitude of changes in y. To do this by hand, first subtract out the mean from the x and y vectors:

c.body.weight <- body.weight-mean(body.weight)
c.metabolic.rate <- metabolic.rate-mean(metabolic.rate)

Now we have the distance of every point from the mean of their distribution. Next, we'll take a z-score by dividing their distances from the mean by the standard deviations of their distributions. This now gives us a score of every point's relative location within their distribution, making the data from these two rather different distributions more comparable (z-scores are very useful)

z.body.weight <- c.body.weight / sd(body.weight)
z.metabolic.rate <- c.metabolic.rate / sd(metabolic.rate)

Now, we multipy the z-score of xi by the z-score of yi. If x and y are positively correlated, then this should produce a vector of large positive numbers. If they are negatively correlated, this should produce a vector of large negative numbers. We'll sum this vector, and then divide it by n-1, producing r.

sum(z.body.weight*z.metabolic.rate)/(nrow(rmr)-1)

Viola! Of course, R has a function for this: cor().

cor(body.weight, metabolic.rate)

But keep in mind that calculations of means and standard deviations goes into determining the correlation coefficient. If one or both distributions are not normal, then you may come back with a weaker correlation than there really is. In this case, by inspecting the density, or by comparing the mean and median of body weights, we can see that the data is a little left skewed. Apparently body weight is sometimes lognormal. The correlation of log(body weight) and metabolic rate isn't much higher, but it is a little.

cor(body.weight, metabolic.rate) ## == 0.7442379
cor(log(body.weight), metabolic.rate) ## == 0.7493597 ### Correlation Matrix

If you give cor() a matrix of values, it will return a matrix of correlations between the columns. We can do this with the data we were just looking at.

cor(rmr)

This is very boring since there are only two values. Columns are, of course, prefectly correlated with themselves, and the two sides of the matrix will be symmetrical.

I've been playing around with ways to visualize correlation matrices, and here's a result. The following graph is based on mean F1 values for the vowels listed along the axes for 61 speakers from the Inland North. The darker the square, the larger the absolute correlation. ### Correlation Significance

To determine whether a correlation value is significantly different from 0, you can use cor.test().

cor.test(body.weight, metabolic.rate)

### Non-parametric Tests

Two non-parametric measures of correlation are Spearman's &rho and Kendall's &tau . There aren't separate tests for these. Rather you specify the method argument in cor.test() as either "spearman" or "kendall".

cor.test(body.weight, metabolic.rate,method = "spearman")
cor.test(body.weight, metabolic.rate,method = "kendall")

## Simple Regression

If you want to know what the relationship between two continuous variables is, then you want to do a linear regression. The output of a linear regression is essentially a formula:

yi = &alpha + &beta xi + &epsiloni
which can be read as "a token of the response variable y can be given as a constant (or intercept) &alpha plus &beta times the predictor value, plus some residual error term." The linear regression process trys to minimize the size of the residual error term, maximizing how much of y can be described by x plus some constant.

One example of a formula like this that I find really tractable is temperature conversion. To convert Celcius to Farenheit, the formula is

α=32
β=1.8
ε=0
F° = 32 + (1.8*C°) + 0
For any temperature in Celcius, this formula will return the temperature in Farenheit. Since the conversion is deterministic, the error term is 0.

Now, think about the conversion formula as a prediction instead. In a sense, the formula predicts what temperature a thermometer should give in Farenheit if you know what it gives in Celcius. Similar predictive formulas can be made to relate other things together, like height to weight, or age to fundamental frequency. Of course, there is a little bit of noise in the real world, so the formula prediction won't always be exactly right. How big will the difference be between the predicted (or fitted) value and the observed value be? This difference is given by ε, and it is assumed that the values of ε for every observation will be normally distributed around 0.

To do this in R, we will use the lm() function. Let's look at a linear regression of the body weight and metabolic rate data.

lm(metabolic.rate~body.weight)

The first argument to lm() is a formula. Here, it means "explain metabolic rate in terms of body weight". The simple printed output of lm() is pretty sparse, with just the intercept (&alpha = 811.23) and body weight effect (&beta = 7.06). However, lm() actually produces a complex data object on which quite a few different functions can be performed, and from which quite a few elements can be extracted.

mod <- lm(metabolic.rate~body.weight)
summary(mod)

Using summary() on the object produces a much richer printed output. From top to bottom:

Call: lm(formula = metabolic.rate ~ body.weight)

This is the command that was given to lm()

Residuals: Min 1Q Median 3Q Max -245.74 -113.99 -32.05 104.96 484.81

This is a 5 number summary of the regression residuals. The residuals are the ε for every data point. That is, the size of the difference from the observed value to the value given by the formula intercept+(slope*body.weight). The residuals should be normally distributed. Here, it looks like they're somewhat left skewed.

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 811.2267 76.9755 10.539 2.29e-13 *** body.weight 7.0595 0.9776 7.221 7.03e-09 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

These are the regression coefficients along with their standard errors, and their significance as determined by the t-distribution. The way to understand these significance values is that the intercept and effect of body weight are significantly non-zero. For this model, it's not too surprising that the intercept is significantly different from 0, but occasionally, you may come across a model where the intercept is not significantly different from 0. There, you may consider eliminating the intercept from the model (forcing the fitted line to pass through (0,0)) by adding -1 to the model formula.

The significance of the predictor variable also means that its coefficient is significantly non-zero.

Residual standard error: 157.9 on 42 degrees of freedom

This is a measure of the variance of the residuals.

Multiple R-squared: 0.5539, Adjusted R-squared: 0.5433

The multiple R-squared is simply the square of Pearson's r, or the output of cor(). The adjusted R-squared accounts for the sample size and some other things, and can be read as "the % of variance reduction."

F-statistic: 52.15 on 1 and 42 DF, p-value: 7.025e-09

This F statistic tells you whether or not there is a significant predictor in your data. For this simple regression, it's kind of pointless. However, in a larger regression model with more predictors, none of their coefficients may be significant, but if the F test returns a significant result, then there is some significant effect being accounted for.

You can plot a simple regression line with abline().

plot(body.weight, metabolic.rate)
abline(mod)

The fitted regression line can be given by the formula

metabolic.rate = 811.2267 + 7.0595*body.weight
Residual error terms can be understood as the distance from the data points to the line along the y-axis. We can add this in.

segments(body.weight,fitted(mod),body.weight,metabolic.rate) You would hope that the size of residual errors would be independant of the predictor variable, and that they would be normally distributed. We can do both of these checks by plotting the fitted values by residual values, and by producing a Q-Q plot of the residuals.

plot(fitted(mod),resid(mod)
qqnorm(resid(mod))
qqline(resid(mod))

### Confidence and Prediction Lines

It's possible to produce confidence and prediction bands for the regression line using predict(). Using predict() with no other arguments will give the same output as fitted(). We can also produce two kinds of confidence intervals for these fitted points:

• confidence: our confidence in the current line
• prediction: our confidence of the range where future measurements from this population will lie.
The prediction intervals will be wider than the confidence intervals.

The tricky thing is that the intervals will be calculated for the x-values as they were in the original data frame. In this case, we were lucky enough that the x values were ordered, but they typically won't be. Here's what do do about that.

##create a new data frame with ordered values
##and the same variable name
pred.frame <- data.frame(body.weight = seq(min(body.weight),max(body.weight),length = 100))
##give pred.frame to newdata
conf.int <- predict(mod, newdata = pred.frame, int = "c")

You can see that we've gotten back a matrix with 3 columns. The first is the fitted values, the next is lower end of the interval, and the third is the higher. We can add this to a plot of the regression line with matlines(). When you give matlines() two matrices, it treats the first as the x matrix and the second as the y matrix. It plots the ith column of the x matrix by the ith column of the y matrix. If one of them is just a vector, then it plots all the columns of the matrix against the vector. Here, we'll give the body weight vector from pred.frame as x, and the different interval matrices as the y. matlines() has some pretty strange default col and lty settings, so we'll adjust them by hand here.

plot(body.weight, metabolic.rate)
abline(mod)
matlines(pred.frame\$body.weight, conf.int, lty = c(1,2,2),col = "black")
pred.int <- predict(mod, newdata = pred.frame, int = "p")
matlines(pred.frame\$body.weight, pred.int, lty = c(1,2,2), col = "black")

## Categorical Predictors / ANOVA

If you have a continuous response variable, and categorical predictors, you will want to do a regression with these categorical predictors, and an Analysis of Variance, or ANOVA.

### ANOVA

If you have a sample of continuous data labeled for two or more groups, you want to know whether there are any significant differences between groups in the data. It can be shown that if there is no difference between groups, that is, they have the same μ and σ, then the variance within these groups ≈ the variance between the groups ≈ the sample variance. If the variance between groups is significantly > the variance within groups, then there is a significant group effect on the response variable.

ANOVA's are stupid easy to do in R. I'm sorry I don't have some other, better data immediately available. For now we'll look at the red.cell.folate data from Dalgaard.

summary(red.cell.folate)
anova(lm(folate~ventilation, data = red.cell.folate))
Analysis of Variance Table Response: folate Df Sum Sq Mean Sq F value Pr(>F) ventilation 2 15516 7758 3.7113 0.04359 * Residuals 19 39716 2090

And there you go. In the printed output, the named row is the between groups variance, and the row named Residuals is the within groups variance. Looks as if there is a just about significant group effect here.

### Regression and Contrasts

As you can see, ANOVA in R is carried out on a linear regression object. We can use the lm object itself to look at the size of the differences between groups.

levels(red.cell.folate\$ventilation)
summary(lm(folate~ventilation, data = red.cell.folate))
... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 316.62 16.16 19.588 4.65e-14 *** ventilationN2O+O2,op -60.18 22.22 -2.709 0.0139 * ventilationO2,24h -38.62 26.06 -1.482 0.1548 ...

As you can see from the levels of ventilation, there are three groups in the data, which I will just refer to as A, B, and C. A is the default, or control group, and B and C are the experimental, or treatment groups. The intercept in the lm summary is the mean of group A, which can be confirmed with

g.means<-tapply(red.cell.folate\$folate, red.cell.folate\$ventilation,mean)
g.means

The effect sizes for B and C are the differences of the mean of B and C from A. The p values for B and C indicate whether they are significantly different from A.

g.means-g.means
g.means-g.means

It's important to understand the use of contrasts in linear regressions to interpret the output. By default, R uses treatment contrasts, where the first level of a factor is taken to be the defaut or control, and then the effect sizes reported are the differences of the treatments from the default. Note, this is really only acceptable when there is some level which can be considered the default, or at least only when you are concerned how some levels differ from a baseline. Here is what treatment contrasts look like:

contr.treatment(c("A","B","C"))
B C A 0 0 B 1 0 C 0 1

This means that when doing a linear regression where a factor with levels A, B, and C is given as a predictor, two dummy variables are generated. Dummy variable B is 1 where the factor == B, 0 elsewhere, and dummy variable C is 1 where the factor == C, 0 elsewhere.

Frequently in linguistic research, there is no level which could be considered the default. For example, when looking at the effect of different following contexts on vowel height, it's not always possible to determine which context is "untreated." Rather, it's possible that all contexts have some effect on the intended vowel target. For this, kind of situation, you might want to use sum contrasts:

contr.sum(c("A","B","C"))
[,1] [,2] A 1 0 B 0 1 C -1 -1

This changes the dummy variables used in the regression. Here, variable [,1] is 1 when the factor == A, -1 when the factor == C, and 0 elsewhere, and variable [,2] is 1 when the factor == B, -1 when the factor == C, and 0 elsewhere.

Although it's probably inappropriate, let's do a regression on the red.cell.folate data with sum contrasts.

summary(lm(folate~ventilation, data = red.cell.folate,contrasts = list(ventilation = "contr.sum")))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 283.69 10.06 28.188 <2e-16 *** ventilation1 32.94 13.73 2.400 0.0268 * ventilation2 -27.25 13.37 -2.038 0.0557 .

The interpretation of regression output with sum contrasts is a little different. Here, the intercept will be some kind of abstract mean, and the regression coefficients will represent the difference of the factor level from that mean. The first coefficient is the size of the effect of the first level, and the second the size of the second level. The effect size of the third level is the sum of the first two *-1

mod<-lm(folate~ventilation, data = red.cell.folate,contrasts = list(ventilation = "contr.sum"))
mod\$coef
sum(mod\$coef[2:3])*-1

The p-values here indicate whether the coefficient is significantly different from the mean given in the intercept.

**Note** See my recontrast.r script page for usage notes on a script I wrote to manage contrasts. It is a lot easier than what I've done here, and it'll return meaningful names in the model output, not numbered variables.

#### The Lesson

Make sure your levels are in the right order if you're going to do a regression, with the default or control group first. If you can't assume there is some untreated group, consider using sum contrasts.

##### Why the fuss?

Remember above, where we thought about linear regression in terms of temperature conversion? That formula involved multiplying degrees in Celcius by 1.8. If we were looking at data on the pitch of people's voices, and we wanted to see how the person's sex affected their pitch, how would we write the prediction formula? What is (2*Male)? Categorical predictors like this are only meaningful when compared to something. If you haven't thought through how you are making these comparisons (setting up the contrasts), then you cannot really understand what the coefficients for each categorical predictor mean.

#### Pairwise Testing

If you want to test all possible differences between groups, then you want to use pairwise.t.test(). Note: you don't want to do a test for every pair of groups, because the more tests you do, the more likely you are to get an random p < 0.05 result (Type I error). pairwise.t.test() corrects for the fact repeated tests are being done.

pairwise.t.test(red.cell.folate\$folate, red.cell.folate\$ventilation)
Pairwise comparisons using t tests with pooled SD data: red.cell.folate\$folate and red.cell.folate\$ventilation N2O+O2,24h N2O+O2,op N2O+O2,op 0.042 - O2,24h 0.310 0.408 P value adjustment method: holm

### Multiple Regression

Frequently when you draw data from some other source than an experiment, the range of possible predictors has not been constrained. Regression analyses can be used to filter through the possible predictors to see which are reliable. Gelman and Hill 2008 is a rather comprehensive guide to using regression models this way. We'll just play around with some toy examples here.

Dalgaard demonstrates how to do multiple regression model fitting with data on cystic fibrosis.

summary(cystfibr)

pemax is the response variable we're interested in. To begin with, let's fit a regression with all the predictor variables just in the order they're in the dataframe.

lm(pemax~age+sex+height+weight+bmp+fev1+rv+frc+tlc,data = cystfibr)->mod
summary(mod)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 176.0582 225.8912 0.779 0.448 age -2.5420 4.8017 -0.529 0.604 sex -3.7368 15.4598 -0.242 0.812 height -0.4463 0.9034 -0.494 0.628 weight 2.9928 2.0080 1.490 0.157 bmp -1.7449 1.1552 -1.510 0.152 fev1 1.0807 1.0809 1.000 0.333 rv 0.1970 0.1962 1.004 0.331 frc -0.3084 0.4924 -0.626 0.540 tlc 0.1886 0.4997 0.377 0.711 Residual standard error: 25.47 on 15 degrees of freedom Multiple R-squared: 0.6373, Adjusted R-squared: 0.4197 F-statistic: 2.929 on 9 and 15 DF, p-value: 0.03195

There is no single predictor with a significant coefficient, but the F test at the bottom has a significant value. This means that this model is significantly better than a model with no predictors. That is, these predictors significantly reduce variance. To see how the predictors reduce variance directly, we'll use anova()

anova(mod)
Analysis of Variance Table Response: pemax Df Sum Sq Mean Sq F value Pr(>F) age 1 10098.5 10098.5 15.5661 0.001296 ** sex 1 955.4 955.4 1.4727 0.243680 height 1 155.0 155.0 0.2389 0.632089 weight 1 632.3 632.3 0.9747 0.339170 bmp 1 2862.2 2862.2 4.4119 0.053010 . fev1 1 1549.1 1549.1 2.3878 0.143120 rv 1 561.9 561.9 0.8662 0.366757 frc 1 194.6 194.6 0.2999 0.592007 tlc 1 92.4 92.4 0.1424 0.711160 Residuals 15 9731.2 648.7

What this ANOVA table says is that after age has been added as a predictor, no other predictor significantly reduced the variance of the model. It's possible to compare two models, one of which is a subset of the other, to see if one is significantly better than another.

lm(pemax~age,data = cystfibr)->mod1
anova(mod1,mod)
Analysis of Variance Table Model 1: pemax ~ age Model 2: pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc Res.Df RSS Df Sum of Sq F Pr(>F) 1 23 16734.2 2 15 9731.2 8 7002.9 1.3493 0.2936

What this ANOVA table tells is is that after adding age, none of the other predictors put together significantly improve the model.

However, this is partially because age was added to the model first. It may be possible that some other predictor would reduce the variance of the model even more if added first, but since age was added, this other predictor doesn't reach significance. This introduces the need for model search. There are two ways to go about it. First, you could add or remove variables by hand and check models against eachother. This is what Dalgaard goes through. The benefit of this is that you can apply some logical order to the process according to what you know about the variables you're looking at.

The other option is to do a stepwise regression using step(). They way step works is it first fits a model with no predictors. Then, it sees which predictor would significantly improve the model the most. Then it adds it. Then, it iterates through this process, adding and subtratcting predictors until neither adding nor subtracting predictors would significantly improve the model.

step(lm(pemax~1,data = cystfibr),pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc)->mod2 summary(mod2)

Apparently, stepwise regressions were very in vogue a number of years ago, but now within the statistics community they're less so.

### Polynomial Regressions and Interactions

We'll do these during logistic regression I think.