Homework #1: A note on correlation

In the instructions for homework #1, we asked you to take a look at the r2 coefficient from using multiple regression to predict second-formant values in the TIMIT data as a (linear, additive) function of sex and vowel identity.

The recipe that we gave you went like this:

>> [c,cint,r,rint,stats]=regress(b,A);
 >> stats(1)
         ans =
         0.8534
 >>

Some people with their own copies of Matlab found that the regress() function is only available via the statistics toolbox, which costs extra.

In general, you can get access to the various additional toolboxes by using the versions of Matlab installed in the IRCS IGERT lab, in the CETS labs, etc. And in the future, we'll try to warn you when a homework problems requires or suggests a function from one of the add-on toolboxes.

In this case, though, the use of the regress() function is not adding anything that you can't easily do yourself without it -- and you can learn something useful by doing it yourself!

The (in)famous r2 value is just the square of (Pearson's) r, which is just the correlation between the observed values (here of the second formant) and the predicted values.

There are various ways to think about the correlation, but in this context, the easiest way is to recognize it as the inner product of mean-corrected length-normalized vectors.

The observed values were in our vector b, and (given the predictor matrix A) we can put the predicted values in a vector c by the simple sequence

>> x=A\b;
>> c=A*x;

Now the mean-corrected version of b and c are just:

>> bb=b-mean(b);
>> cc=c-mean(c);

And then we normalize to length 1:

>> bb=bb/sqrt(bb'*bb);
>> cc=cc/sqrt(cc'*cc);

And then r and r2 are just

>> r=bb'*cc
r =
    0.9238
>> r^2
ans =
    0.8534

As we'll discuss later, r2 is also the "proportion of variance accounted for". If we define the residual d as c-b -- that is, the difference between the predictions and the observations -- then one plausible measure of the quality of the prediction is the relationship between the variance of the residual and the variance of the original data.

If the prediction is perfect, then the residual will have zero variance. But such predictions are normally not perfect -- there's some noise and/or some meaningful variation that is not accounted for by our model. It's natural to scale this residual variance as a proportion of the original variance. This is the proportion of variance NOT accounted for -- one minus this quantity is the proportion of variance accounted for.

We can approach r2 for this example in Matlab as follows, treating the variance as the mean squared difference from the mean:

>> d=c-b;
>> bv=((b-mean(b))'*(b-mean(b)))/length(b);
>> dv=((d-mean(d))'*(d-mean(d)))/length(d);
>> pvuna=dv/bv
pvuna =
      0.1466
>> pva=1-pvuna
pva =
      0.8534

Note that we get the same result by using the built-in variance function var():

>> var(d)/var(b)
ans =
      0.1466

This is true although the built-in variance function will return a slightly different value, since it computes the sum of squared differences from the mean divided by N-1 (instead of divided by N, for reasons to be discussed later).

And also note that the built-in function corr() will compute the correlation r for us directly:

>> corr(b,c)
ans =
      0.9238

However, you should remember that the correlation r is just the inner product of mean-corrected, length-normalized vectors; and that if the two vectors in question are the observed values and the values predicted by a model, then r2 is the "proportion of variance accounted for" by the model.

A correlation of 0.92, with 85% of the variance accounted for, is excellent.

However, there is something deeply wrong with this model, at least from a scientific point of view. One way to see this graphically would be a plot like this one:

>> fmodel=[x(1)+x(2) x(1)+x(2)+x(3)];
>> freal=[mean(f_aa) mean(f_iy)];
>> mmodel=[x(1) x(1)+x(3)];
>> mreal=[mean(m_aa) mean(m_iy)];
>> plot(b,c,'ko')
>> hold on
>> plot(freal,fmodel,'rx-', 'LineWidth', 2)
>> plot(mreal,mmodel,'b^-', 'LineWidth', 2)

with results that look like this:

What does this mean?