Problem set #7 -- COGS501

Due 11/23/2011

1. K-means clustering

Download KM1.m, and put it somewhere that Matlab can find it. Execute it to see a simple demo, in which three random clouds of points are generated, mixed, and then separated again by k-means clustering.

Make sure that you understand what is being done at each step, and how Matlab is being made to do it.

1.1. Overlapping sets

What will happen to the recovered clusters as the underlying clouds of points start to overlap? Modify the demo to make this happen (i.e. increase the diagonal terms in the covariance matrices sigma). Demonstrate the results graphically.

1.2. Varying k

What if k is not the same as the number of underlying clouds of points? Modify the demo to generate four random clouds rather than three, and then apply the k-means algorithm with k = 3,4,5.

If you like, use the Matlab function kmeans() rather than your own implementation.

2. Expectation-Maximization for Gaussian Mixture Models

(Assignments in this section are in green type, like this.)

Often we think we know the statistical model that generated some observations; or at least, we believe we know a model that is close enough to the truth to be useful; but the model involves a number of different distributions, and we don't know how to divide up individual observations among them.

One simple and common example is a "gaussian mixture model" (GMM), where we assume that we've mixed together data drawn from more than one gaussian ("normal") distribution.

If we knew which observation came from which gaussian, we'd just estimate the means and (co)variances in the usual empirical way, and we'd be done. But what if we lack this apparently crucial piece of information?

There's a general approach to solving problems of this type, where the "indicator variable" that assigns observations to distributions is hidden. It's called Expectation-Maximization (EM). The general idea is to start with an arbitrary (or heuristically-estimated) set of parameters for the model, and allow this model to divide up each observation among the available distributions, in proportion to the probability that each distribution assigns to it. We then re-estimate the parameters of the distributions, using a weighted form of the usual empirical estimation method. And then we do the whole thing over again, and again, until the process converges in the sense that the estimated parameters stop changing.

Here's a Matlab example of the simplest possible case: two univariate gaussians.

We generate 100 points from a distribution with mean 10 and SD 20, and 100 points from a distribution with mean 100 and SD 10, and mix them together:

amu(1) = 10; amu(2) = 100; asigma(1) = 20; asigma(2) = 10;
aD = [random('Normal',amu(1),asigma(1),100,1); random('Normal',amu(2),asigma(2),100,1)];
aDp = randperm(200); aD1 = aD(aDp);
hist(aD1, 20);

The histogram:

Let's start from some random guesses about the distributions' parameters:

% means:
eamu(1) = random('unif',1,100,1); eamu(2) = random('unif',1,100,1);
% standard deviations:
easigma(1) = random('unif',1,30,1); easigma(2) = random('unif',1,30,1);

And we'll set the "mixing parameters" (overall proportion of data from each distributions) to [0.5 0.5].

epi(1) = 0.5; epi(2) = 0.5;

This gives us the following situation:

 
Mean
SD
Mix
 
1
2
1
2
1
2
True 10 100 20 10 0.5 0.5
Estimated 21.18 55.97 7.21 16.93 0.5 0.5

Now for the "E(stimation) step" we start by calculating the data-point probabilities given the current parameters:

for c=1:2
    paD1{c} = pdf('norm', aD1, eamu(c), easigma(c))
end

And then the weight or degree of responsibility that each distribution has for each data point:

for c=1:2
    w{c} = (epi(c)*paD1{c})./(epi(1)*paD1{1}+epi(2)*paD1{2});
end

This allows us to do the "M(aximization) step", where we use these weights to re-estimate the parameters:

% New means:
for c=1:2
    eamu(c) = sum(w{c}.*aD1)/sum(w{c});
end
% New sigmas:
for c=1:2
    x = aD1-eamu(c);
    easigma(c) = sqrt(sum(w{c}.*(x.*x))/sum(w{c}));
end
% New mixing proportions:
for c=1:2
    epi(c) = sum(w{c})/200;
end

This gives us:

 
Mean
SD
Mix
 
1
2
1
2
1
2
True 10 100 20 10 0.5 0.5
Estimated 14.09 75.45 11.43 43.14 0.32 0.68

And from the next few iterations we get:

 
Mean
SD
Mix
 
1
2
1
2
1
2
True 10 100 20 10 0.5 0.5
Step 2 11.18 75.54 13.5 41.9 0.29 0.71
Step 3 8.47 76.6 15.0 39.3 0.30 0.70
Step 4 6.62 80.13 16.2 35.6 0.33 0.67
Step 5 6.14 84.28 16.8 31.0 0.36 0.64
Step 6 7.05 88.68 17.2 26.1 0.40 0.60
Step 7 8.75 93.14 17.7 20.4 0.44 0.56
Step 8 10.81 97.14 18.7 13.9 0.48 0.52
Step 9 12.35 99.39 19.6 8.9 0.50 0.50
 
...
...
...
...
...
...
Final (12) 12.50 99.54 19.8 8.5 0.50 0.50

Though the final parameter values are not exactly the same as the true generating parameters, in fact they have a slightly higher probability of generating the observations than the "true" parameters do.

That's really all there is to it, in essence.

Except for a lot of traps and tricks...

You have to be careful that the probability calculations and the weighted re-estimation formulas are correct, obviously.

And the iteration will often reach a local optimum, which in the case of complex models may be quite far away from the global optimum, so that choice of an appropriate starting point -- or sampling a number of randomly-chosen starting points -- may be important.

2.1 Extension to 3 univariate gaussians

To be sure that you understand how the process works, extend the example above to the case of a mixture of three (rather than two) one-dimensional "normal" distributions. That is, generate new fake data as a mixture of three gaussians, and estimate their parameters using EM.

2.2 Extension to multivariate gaussians

Apply EM to learn the parameters of a mixture of three two-dimensional Gaussians, using the same underlying parameters as in the k-means example in (1).

You might set it up like this:

% Generate 100 random datapoints from each of 3 multivariate Gaussians:

N = 100;

mu{1} = [50 100]; sigma{1} = [40 10; 10 70];
D1 = mvnrnd(mu{1},sigma{1},N);

mu{2} = [100 50]; sigma{2} = [50 0; 0 40];
D2 = mvnrnd(mu{2},sigma{2},N);

mu{3} = [70 70]; sigma{3} = [50 20; 20 30];
D3 = mvnrnd(mu{3},sigma{3},N);

% Combine into one matrix, and permute the rows randomly

Dall = [D1;D2;D3]; Dp = randperm(3*N); Dall1 = Dall(Dp,:);

If we knew the true category (i.e. distribution) of each datapoint (and we do!), here's how we'd estimate the underlying parameters:

X = [ones(100,1); 2*ones(100,1); 3*ones(100,1)];
Cats = X(Dp);

for c=1:3
emu{c} = mean(Dall1(Cats==c,:));
esigma{c} = cov(Dall1(Cats==c,:));
end

And we could compare the estimated values to the original ones, say this way:

for c=1:3
    [mu{c} emu{c}]
    [sigma{c} esigma{c}]
end

and expect to see something like this:

for c=1:3
   [mu{c} emu{c}]
   ans =
        50.0000 100.0000 50.8483 100.1348
   [sigma{c} esigma{c}]
   ans =
        40.0000 10.0000 31.5826 -5.8977
        10.0000 70.0000 -5.8977 66.8212

   [mu{c} emu{c}]
   ans =
         100.0000 50.0000 98.9983 49.9481
   [sigma{c} esigma{c}]
   ans =
        50.0000 0 55.5541 -7.4531
        0 40.0000 -7.4531 43.0792

   [mu{c} emu{c}]
   ans =
        70.0000 70.0000 69.3533 68.9214
   [sigma{c} esigma{c}]
   ans =
         50.0000 20.0000 69.4983 27.2761
         20.0000 30.0000 27.2761 31.9697
 end

Now estimate the parameters of the model (three 2D gaussians and three mixing proportions) using EM, without knowing which datapoint comes from which distribution.

One key step in the process is estimating the probability of a particular datapoint given an assumed multivariate gaussian -- Matlab has a convenient function for doing this, mvnpdf(), which can be used in place of pdf() in the univariate case.

Here's an example of using mvnpdf() to produce an interesting plot:

for c=1:3
   Y{c} = mvnpdf(Dall1, mu{c}, sigma{c});
end
plot3(Y{1},Y{2},Y{3},'rx')

Here we've used the parameters of the true distributions, the ones we used to generate the random data.

2.2. Comparison to k-means

It should be clear that (in the simple demo case, anyhow) the k-means classification of each point is closely analogous to the most-probable mixture component assigned by the Gaussian Mixture Model (GMM) to that point.

Will these classifications always be congruent? [Hint: no.]

Work out a case where there should be a large difference between k-means classification and EM "responsibility", and demonstrate it, assuming that the parameters of the GMM are known in advance and do not need to be estimated. [Hint: the k-means algorithm classifies points based on uniform euclidean distances to the class centroids; but the gaussians in a GMM might have very different covariances.]

2.2. Effects of overlapping categories

Modify the simple demo case to make the categories overlap (a little, or a lot). What happens to the EM GMM estimates?

2.3 Varying number of gaussians

Modify the simple demo case to make the number of gaussians used in generating the data different from the number estimated by EM. Explore the results graphically.

3. Theory

Review Devdatt Dubhashi's lecture notes on k-means and GMMs. Be sure you understand the relationship to what you've done in these exercises.