Problem Set #5 -- COGS501

Read in the Peterson-Barney vowel data and compute the Fisher discrimant for the /ih/-/iy/ contrast, as in the lecture notes.

1. Extract a subset of this data that you know to be linearly separable, i.e. because a threshold on the Fisher dimension separates it. If w is the weighting for the Fisher discrimant, and thresh is the threshold value that you found produces the minimum classification error, then you might do this as follows:

% Which class is above the threshold?
Vih = mean(Wih)>mean(Wiy);
%
if(Vih)
    whichih = Wih>thresh;
    whichiy = Wiy<thresh;
else
    whichih = Wih<thresh;
    whichiy = Wiy>thresh;
end

Then the separable subset of Xiy will be Xiy(whichiy,:), and the separable subset of Xih will be Xih(whichih,:).

Since the perceptron "bias" is equivalent to an input that is always on, add a column of ones to your input matrices. Concatenate your data into a single N-by-5 matrix (where N is the number of points in your linearly-separable subset of Peterson-Barney /ih/ and /iy/).

Make a binary vector of target values, representing the true classification of the rows of your data matrix. It doesn't matter which is 0 and which is 1.

Initialize a five-element weight vector with zeros.

(At this point, you might want to read the section in the Matlab help files on the perceptron learning rule, or the wikipedia entry on perceptron learning, though this is not crucial).

Now if your data matrix is XX, your target matrix is TT, your weight vector is W, and there are Nxx points in the separable subset of your data, the "classical" perceptron learning algorithm looks like this:

for iter=1:niter
    ngood = 0;
    for(d=1:Nxx)
       p = XX(d,:);
       a = (W*p' > 0);
       ngood = ngood+(a==TT(d));
       % update weights
       W = W + (TT(d)-a)*p;
    end
    if(mod(iter,10)==0)
       sprintf('%d correct on iteration %d\n',ngood, iter)
    end
    if(ngood == Nxx)
       break
    end
end

Note that TT(d)-a is the "error" -- it'll be -1, 0, or 1 depending on the outcome of the attempt at classification.

I've set this loop up so that it print out the number of correct classifications every 100th time through the loop. (It might be better to print the number of errors, since then you don't have to know what the number of cases is...)

Run the loop for a while and watch what happens. The usual thing is that it gets close pretty fast, but may take quite a long time to finally converge. (You may want to change the code to print the results every 100th or every 1000th time through the loop.)

There are various things that you could try. One might be to train on a randomized order, rather than all the members of one class followed by all the members of the other class. Another might be to normalize the length of the description vectors, to avoid problems with outliers. With both of those innovations, the loop might look like this:

order = randperm(Nxx);
for iter=1:niter
    ngood = 0;
    for(n=1:Nxx)
       d = order(n);
       p = XX(d,:);
       a = (W*p' > 0);
       ngood = ngood+(a==TT(d));
       % normalized weight update:
       W = W + (TT(d)-a)*(p/sqrt(p*p'));
    end
    if(mod(iter,100)==0)
       sprintf('%d correct on iteration %d\n',ngood, iter)
    end
    if(ngood == Nxx)
       break
    end
end

Does this converge faster in the case at hand?

2. Now apply the perceptron that you have trained to the data that was left out in order to create the separable set. How well does it work?

3. Pick half of your separable data at random, and train a perceptron to classify it. How well does the result work on the other half of the data?

4. (Extra Credit) Try out the ideas in

Tristrom Cooke, "Two Variations on Fisher's Linear Discriminant for Pattern Recognition", IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2): 268-273, 2002.

In particular, compare generalization performance on split halves of the original (full) data set.

5. (Extra Extra Credit) An innovation that helps deal with noisy non-separable data, and may sometimes speed up convergence on separable data, is the "pocket algorithm" of Gallant (1990). Implement it and try it on this problem (as in 1 above). Does it converge faster?

Try it on the original data (i.e. the non-separable set), and compare its performance (in terms of convergence rate and classification accuracy) with the basic perceptron algorithm, and also with the Fisher Disciminant.