Problem Set #1 -- COGS 501

Due Monday 9/21/2011

This is an simple exercise in linear regression. The data that we'll use are formant frequencies taken from the analysis of human speech. Formants are concentrations
of energy in frequency and time, which show up as dark bands on speech
spectrograms, and convey information about the size and shape of a speaker's
mouth and throat. (We'll learn a bit more about spectrograms and formants later in the course.)

In particular, formants vary both with the overall size of the speaker's vocal tract and with the speech sound being produced. Our dataset (from the TIMIT database) consists of second formant values at the midpoints of two vowels - /iy/ as in "beet" and /aa/ as in "hot" - spoken by both men and women. We expect the formant values to differ systematically because the mouth shapes for /iy/ and /aa/ are different, and also because, on average, men have larger vocal tracts than women do. (The values may also vary for other reasons, such as dialect differences, the effects of phonetic context, speech rate, measurement error -- these values come from an automatic formant tracker, so there are definitely some measurement errors -- and the phase of the moon...)

Download the file f2data.mat to your computer. You can do this by right-clicking on the link in this browser window (control-clicking on a Mac), and choosing "Save link as" or the equivalent. Put it somewhere that is accessible to Matlab (or Octave). (You can make a folder accessible by using File>>Set Path in Matlab.)

Follow the instructions below, creating a Matlab m-file that executes the needed instructions. This should be a file named something like PS1.m, in a folder accessible to Matlab. Matlab's built-in editor is probably a good way to create and modify this m-file. You can execute it in Matlab by the command PS1 (or whatever your file's name is).

You may want to start your file with something like

format short;
format compact;
clc
echo on

which will allow you to watch your commands as they execute.

You can use Matlab comments -- lines starting with % -- to put in section headings, as well as for adding observations and questions.

If you want to arrange for the execution of your file to pause at a certain point, use the pause command, which will wait for you to type a key before continuing.

(When you send in your solution by email as an attached file, it'll be helpful if you name the file something like MyNamePS1.m, so that we can keep track of whose solutions are whose.)

Start by loading f2data.mat into Matlab:

load 'f2data.mat'

You should have four column vectors f_iy, f_aa, m_iy and m_aa containing second formant values ('f' for female, 'm' for male).

1. Plot histograms of the four vectors to see how the values are distributed. (If you don't know how to plot a histogram in Matlab, open a help window and search for "histogram".)

2. Calculate the means of the distributions. What is the relation between
the /iy/ means and the /aa/ means? What is the relation between the women's
and men's means?

We will now try to predict the formant values on the basis of the speaker's
sex and the vowel being produced. Obviously the prediction can't be perfect,
because we are only using two pieces of information male vs. female and /iy/ vs. /aa/, and we know that each of the four vectors contains a range of values.

3. First concatenate the four vectors into a single long vector b:

b = [f_iy; f_aa; m_iy; m_aa];

Look at the histogram of b. Can you still distinguish the four underlying
distributions in the combined histogram?

4. Now set up the prediction matrix A. It will have one row corresponding to
each formant value in b, and three columns. One of the columns is for the
male-vs.-female coefficient, and another is for the /iy/-vs.-/aa/ coefficient. In each of these columns, we'll call one of the values 0 (male or /aa/) and the other value 1 (female or /iy/). This is an essentially arbitrary choice -- you can explore this for yourself by trying it the other way around -- but in this case, we've assigned 1 to the feature value that tends to make F2 bigger, so that we'll get positive values rather than negative ones.

Given that we've set things up this way, we need another column for a constant term, because male /aa/ vowels don't have a second formant value of 0. (Of course, we could have set things up so that male, female, /aa/ and /iy/ are all separate features, which just happen to be distributed in a funny way; or assigned 1 to females and 2 or males; or etc. You should think about the consequences, if any, of these other choices -- and a good way to think about such things is to try them out...)

Each row of A represents one second-format value. If we use the first column for the constant term, the second column for the speaker's sex, and the third column for the choice of vowel, we'll have:

male /aa/ 1 0 0
male /iy/ 1 0 1
female /aa/ 1 1 0
female /iy/ 1 1 1

There are lots of ways to set up A. For example, the command

A = ones(length(b),3);

creates a matrix of the right dimensions, full of ones.

We want the column for the constant term to be all ones, so we can leave the first column alone.

In the second column, we want to assign 1 for female and 0 for male. That means we need to change the last length(m_iy)+length(m_aa) values in the second column to zeros (remember the order in which we concatenated the vectors?). Here's one way to express this in Matlab-ese:

A(length(A)-(length(m_iy)+length(m_aa))+1:length(A),2) =
zeros(length(m_iy)+length(m_aa),1);

Perform the corresponding operation for the third column. You might want to
do it in two (or more) steps, assigning various lengths and sums of lengths to separate variables, so that it's clear to you what's going on.

A different approach would be to assemble A out of vectors of ones and zeros of the appropriate length. However you create it, scan A when you're done to be sure that you got it right!

5. Now we want to solve for the effects of sex and vowel choice.

Our trivial little model is trying to predict the F2 values in b by adding up three values: the constant term, the effect of being female (or not), and the effect of being /iy/ (or not). If we call the vector of predicted F2 values c, then we want to find a three-element column vector x such that

c = Ax

creates a vector c that is as close to b as possible. Each element of c will be created as the inner product of the corresponding three-element row of A and x -- that is, the sum of the constant term, the effect of being female, and the effect of being /iy/.

Obviously c can't be exactly the same as b, unless all the values coded in each category were the same -- and the histograms you plotted earlier showed you that this isn't true here.

But if the relationship were exact, we'd have a set of equations like

2300 = constant + female + iy
1500 = constant + female
2000 = constant + iy

And we all learned to solve such sets of equations back in junior high school, say by a sequence of manipulations like

constant = 1500 - female
2300 = (1500 - female) + female + iy = 1500 + iy
iy = 800

and so on, to learn that the constant term is 1200, the female effect is 300, and the /iy/ effect is 800 (in this hypothetical example!).

In matrix form, we can represent this as the single equation

Or more compactly,

r = Sx

This looks like an equation that we can solve for x just by dividing both sides by S, and in fact in Matlab, we can do exactly that:

r = [2300 1500 2000]';
S = [1 1 1; 1 1 0; 1 0 1];
S\r
ans =
        1200
         300
         800 

The reason we need to use the backslash (called "matrix left divide") is that matrix multiplication is not commutative (i.e. AB is not in general the same as BA), and so we need to distinguish "dividing on the left" from "dividing on the right", and in this case we need to "divide on the left".

This is neat, but the trouble is, our original data was not so simple. The first couple of values for each of the categories in our little sub-example were actually:

2571 = constant + female + iy  # female /iy/
2190 = constant + female + iy
1397 = constant + female        # female /aa/
1355 = constant + female
2108 = constant + iy               # male /iy/
1591 = constant + iy

So there's no assignment of values to the coefficients x that will actually make

c = Ax

come out with c = b , because the 270 elements of c can only take on four possible different values -- we have what is called an "overdetermined" equation.

However, some choices of x produce better approximations to our data than others do. If we define a metric for the approximation error -- say, the sum of the squares of the differences between corresponding elements of c and b -- then we can find the value for x that minimizes this error.

Happily, the same Matlab "matrix left divide" operator that solves the exact matrix equation will also give us the "best" value for x, in the least-squares sense, if we just execute:

x=A\b;

We'll learn more later about some other ways of thinking about what's happening here.

For now, let's also observe that we can do the same thing with a function call

x=regress(b,A)

The function regress() will also tell us more about how well the prediction fits the data:

[c,cint,r,rint,stats]=regress(b,A);

Stats is a four element vector whose first element is the r-squared coefficient that you may have encountered elsewhere (and you'll encounter it again in this course!) The closer to 1, the better the fit.

The third element is the "p value" that is often quoted in statistical analyses. Values close to zero are good. Look at stats(1) and stats(3). By these standards, the prediction appears to work pretty well. (If your results aren't so great, maybe you set A up wrong!)

However, that doesn't mean that the linear model is "the truth" about the
data. Let's look at prediction errors.

6. The predicted value for the data in f_iy is

[1,1,1]*c

Plot a histogram and take the mean of f_iy - [1,1,1]*c. Does the prediction
seem to be off in a systematic way? Look at the corresponding distributions
for the other three cases. In what ways are they similar or different? Why do you think that is?

Extra Credit

If the preceding exercise was (too) easy for you, try this one.

Download pb.Table, which contains the data from the famous paper G.E. Peterson and H.L. Barney, "Control Methods Used in a Study of the Vowels", JASA 23 (1), 1951.

The data table contains 9 tab-separated columns, which represent F0 and formant measurements taken from 10 vowels pronounced twice (in the context h__d) by each of 33 men, 28 women and 15 children:

1: m = man, w=woman, c=child
2: m=male, f=female
3: N=speaker number
4: V=vowel name (heed/iy, hid/ih, head/eh, had/ae, hod/aa, hawed/ao, hood/uh, who'd/uw, hud/ah, heard/er)
5: IPA
6: F0 in Hz
7: F1
8: F2
9: F3

After stripping off the first line of the file (which gives column labels), you can read this data into Matlab using something like:

fid = fopen('pb.Table');
PB = textscan(fid, '%s %s %d %s %s %f %f %f %f',10000);
fclose(fid);

What sorts of (regression) modeling might you use to try to figure out what is going on in this data set?

Try some of them.