Classifying multivariate data

Let's start with the Peterson/Barney vowel formant data:

fid = fopen('pb.Table1');
PB = textscan(fid, '%s %s %d %s %s %f %f %f %f', 10000);
fclose(fid);
%
% 1: m = man, w=woman, c=child
% 2: m=male, f=female
% 3: N=speaker number
% 4: V=vowel name (heed/iy, hid/ih, head/eh, had/ae, hod/aa, hawed/ao, hood/uh, who'd/uw, hud/ah, heard/er)
% 5: IPA
% 6: F0 in Hz
% 7: F1
% 8: F2
% 9: F3
%% Convert column 4 (vowel ID) to integers 1:10 in VID array
% Note that options are 'iy','ih','eh', 'ae','aa','ao','uh','uw','ah','er'
VID=zeros(length(isman),1);
VID = VID + strcmp('iy',PB{4});
VID = VID + 2*strcmp('ih',PB{4});
VID = VID + 3*strcmp('eh',PB{4});
VID = VID + 4*strcmp('ae',PB{4});
VID = VID + 5*strcmp('aa',PB{4});
VID = VID + 6*strcmp('ao',PB{4});
VID = VID + 7*strcmp('uh',PB{4});
VID = VID + 8*strcmp('uw',PB{4});
VID = VID + 9*strcmp('ah',PB{4});
VID = VID + 10*strcmp('er',PB{4});
%
% Make matrix whose rows are vowels, and
% columns are F0, F1, F2, F3:
F0123 = [PB{6} PB{7} PB{8} PB{9}];

And let's compare two adjacent (and similar) vowels:

Xeh = F0123(VID==3,:);
Xae = F0123(VID==4,:);

On all the individual dimensions, the data overlaps heavily. F0 shows very little difference:

XF0 = [Xeh(:,1), Xae(:,1)];
hist(XF0);
legend('/eh/', '/ae/')
xlabel('F0')
title('F0 comparison')

F1 gives us slightly better information, but still not a lot of separation:

XF1 = [Xeh(:,2), Xae(:,2)];
hist(XF1);
legend('/eh/', '/ae/')
xlabel('F1')
title('F1 comparison')

The best threshold on F1 gives us 20% classification error (where 50% is the score you get for guessing one or the other class all the time):

allvals = [Xeh(:,2);Xae(:,2)];
minval = min(allvals); maxval = max(allvals);
tvals = minval:(maxval-minval)/100:maxval;
NP = length(Xeh(:,2))+length(Xae(:,2));
N = length(tvals);
for n=1:N
    cerrorn(n) = sum(Xeh(:,2) > tvals(n)) + sum(Xae(:,2) < tvals(n));
    cerrorp(n) = cerrorn(n)/NP;
end
plot(tvals,cerrorp,'bo-')
title('P-B /eh/ vs. /ae/: Classification Error vs. F1 Threshold')
xlabel('F1 Threshold')
ylabel('Proportion error')

F2 shows a smaller difference:

XF2 = [Xeh(:,3), Xae(:,3)];
hist(XF2);
legend('/eh/', '/ae/')
xlabel('F2')
title('F2 comparison')

And we don't get much from F3:

XF3 = [Xeh(:,4), Xae(:,4)];
hist(XF3)
legend('/eh/', '/ae/')
xlabel('F3')
title('F3 comparison')

But in multivariate space -- especially F1/F2 -- things are better separated:

How should we find the "right" way to dissect the space into an /eh/ part and an /ae/ part?

In "The Use of Multiple Measures in Taxonomic Problems" (Annals of Eugenics, 1936), Ronald Fisher suggested a way to find a new dimension -- as a linear combination of old dimensions -- that will (in some sense) optimally separate two such clouds of points. His equation for the weighting vector w is

where Σy=0 is the covariance matrix for category 0, and μy=0 is the mean vector for category 0, etc.

The new dimension is "optimal" in the sense that it maximizes the ratio of between-class variance to within-class variance:

(This doesn't mean that it's really the "best" way to separate any two clouds of points, in this case or in general, as we'll learn later. But it's simple and quick, and worth learning about.)

In Matlab terms, applied to the Peterson-Barney /eh/-/ae/ data:

w = inv(cov(Xeh)+cov(Xae))*(mean(Xeh)-mean(Xae))';
Weh = Xeh*w;
Wae = Xae*w;
hist([Weh Wae]);
legend('/eh/','/ae/');
title('Fishers Linear Discriminant applied to P-B /eh/ vs. /ae/')
xlabel('Discriminant dimension');

Here's the plot of classification error vs. threshold:

allvals = [Weh;Wae];
minval = min(allvals); maxval = max(allvals);
tvals = minval:(maxval-minval)/100:maxval;
NP = length(Wae)+length(Weh);
N = length(tvals);
for n=1:N
    cerrorn(n) = sum(Weh < tvals(n)) + sum(Wae > tvals(n));
    cerrorp(n) = cerrorn(n)/NP;
end
plot(tvals,cerrorp,'bo-')
title('P-B /eh/ vs. /ae/: Classification Error vs. Threshold')
xlabel('Threshold')
ylabel('Proportion error')

This gets us down to about 8% classification error, instead of 20% for F1, the best single dimension. (Note that we aren't looking at generalization to new examples yet, just "optimal" classification of examples of known category, based on F0/F1/F2/F3 values.)

Let's apply the same idea to /iy/ vs. /ih/:

Xiy = F0123(VID==1,:);
Xih = F0123(VID==2,:);
plot(Xiy(:,2),Xiy(:,3),'rx',Xih(:,2),Xih(:,3),'bo')
legend('/iy/', '/ih/')
xlabel('First Formant'); ylabel('Second Formant');
title('/iy/ vs. /ih/ from Peterson-Barney');


w = inv(cov(Xiy)+cov(Xih))*(mean(Xiy)-mean(Xih))';
Wiy = Xiy*w;
Wih = Xih*w;
hist([Wiy Wih],30);
legend('/iy/','/ih/');
title('Fishers Linear Discriminant applied to P-B /iy/ vs. /ih/')
xlabel('Discriminant dimension');

allvals = [Wiy;Wih];
minval = min(allvals); maxval = max(allvals);
tvals = minval:(maxval-minval)/100:maxval;
NP = length(Wiy)+length(Wih);
N = length(tvals);
for n=1:N
    cerrorn(n) = sum(Wiy < tvals(n)) + sum(Wih > tvals(n));
    cerrorp(n) = cerrorn(n)/NP;
end
plot(tvals,cerrorp,'bo-')
title('P-B /iy/ vs. /ih/: Classification Error vs. Threshold')
xlabel('Threshold')
ylabel('Proportion error')

In this case, we get to about 5% classification error.

In 1958, Frank Rosenblatt introduced an important new way of looking at -- and computing with -- linear discriminants ("The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain", Psychological Review 65(6):386-408). It popularized a striking twist on the semantics of the inner product: the idea was that really, it's a neuron!

Rosenblatt started with the basic equation for a linear discriminant:

The summing-up part of the equation as a neuron's cell body summing the inputs from its dendritic arborization; the weights w as synaptic thresholds; the bias b as the cell's initial level of activation; and the output non-linearity (which turns the gradient value into a binary choice) as the generation of an action potential on the cell's axonal output.

 

And crucially, he didn't propose to "program" the synaptic weights into the system at the start -- he specified an algorithm to "learn" the weighting vector w (and the threshold b) by a trial-and-error process. His method starts with random values, and uses them to classify "training" values, one at a time, adjusting the perceptron's parameters if an example is wrongly classified.

After seeing each example, the weights are adjusted one at a time, according to this equation:

where x(j) is the jth input; w(j) is the jth weight; w(j)' is its updated value; y is the neuron's actual output, and δ is the correct output;and α is a rate-of-learning parameter.

(The bias parameter is updated in a similar way, with a constant 1 as input.)

In other words, on each step, if the example is correctly classified, the algorithm does nothing; if the classification is wrong, the weights are updated by a positive or negative copy of the data, depending on the direction of the error.

An interactive applet to demonstrate the linear perceptron's behavior is here.

This algorithm, in its simplest form, doesn't always converge; and when it converges, it doesn't always generalize very well. But the idea is a powerful one.

(To be discussed: XOR and the Minsky/Papert critique. The non-linear perceptron as a response, and the PDP revolution.  The "kernel trick", SVM, MIRA and other generalizations of discriminant analysis.)

Tristrom Cooke, "Two Variations on Fisher's Linear Discriminant for Pattern Recognition", IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2): 268-273, 2002.

Koby Crammer, Jaz Kandola, Yoram Singer, "Online Classification on a Budget", NIPS 2004.

David Olmsted, "History and Principles of Neural Networks to 1960".