More on Subspace Methods

Background

In the discussion of early color vision, we explored the way that multi-dimensional optical spectra are projected by the human visual system into a three-dimension subspace. Human retinal cones have three light-sensitive pigments, each with a different spectral sensitivity, and thus the encoding of colors in the retina by the responses of these three pigments can only span a three-dimensional space. Despite the fact that the cell responses are highly non-linear, the overall system behaves in a remarkably linear way, at least when examined in classical color-matching experiments.

We can focus on two different aspects of this situation. One is dimensionality reduction: human perception of colors "lives" in a space with only three dimensions, even though electromagnetic radiation (in the visible-light region of the spectrum) has many more degrees of freedom than that, so that lights with very different spectra can be "metamers". Another is change of coordinates: the natural dimensions of early color vision are not the frequencies of visible light, but the spectral sensitivities of the retinal pigments.

In dealing with any sorts of multi-dimensional data, it is often helpful to find a new coordinate system, different from the one in which the data initially presents itself. This new coordinate system might be chosen because its basis vectors have particularly useful mathematical properties, independent of the specific properties of the data set being analyzed. As we saw earlier, this is the case for Fourier coefficients: because sinusoids are eigenfunctions of linear shift-invariant systems, a set of coordinates with sinusoids as basis vectors reduces the effect of an LSI to element-wise multiplication rather than overall convolution. "Wavelets" have similar properties, adding the benefit of locality in time (and/or space) as well as frequency.

Alternatively, we might choose our new coordinate system in a data-dependent way, so as to optimally match some structure found in the data itself. We might do this because the result allows us (or more likely, a computer algorithm) to "see" patterns in the data more clearly; we might also do it because in the new coordinate system, we can represent the data more succinctly, throwing away many or even most of the apparent degrees of freedom in the original representation. (See here for some examples with synthetic data.)

This lecture presents a general framework for thinking about data-dependent choices of coordinate systems.

One warning: although these techniques are widely applicable and can be very effective, they are also very limited. Given a body of observations described as vectors, they can provide a new descriptions in terms of vectors whose elements are linear combinations of the elements of the original vectors. (There are also a few pre-processing steps that can sometimes help to make such techniques work better, such as subtracting averages to center the data on the origin, or "sphering" the data to remove correlations).

There are many ways to do this, and often the techniques are spectacularly effective. It's easy, all the same, to create or find data sets whose instrinsic structure is not revealed by choosing a new set of basis vectors, and indeed in some cases the results of the kind of analysis we are considering here might be to introduce unpleasant artifacts that obscure what is really going on. We'll close the lecture with an example of this type.

The "Fourier Expansion" equation

We start with a set of observations expressed as vectors, typically with a large number of dimensions (i.e. each vector has a lot of elements). This could be anything: pictures of faces, EEG recordings, acoustic signals from different microphones, etc.. The general idea is to find a "natural" set of coordinates for this body of data, and to use this to produce a reduced-dimensional representation by linear projection into a subspace. Location in the subspace is then used as a basis for measuring similarity among observations, assigning them to categories, or whatever. The motivation for the dimension reduction may be efficiency (less memory, faster calculations), or better statistical estimation (e.g. of distributions by observation type), or both.

A new set of coordinates (for an N-dimensional vector space) is equivalent to a set of N orthonormal basis vectors b1 ... bn. We can transform a vector x to the new coordinate system by multiplying x by a matrix B whose rows are the basis vectors: c = Bx. We can think of the elements of c as the values of x on the new dimensions -- or we can think of them as "coefficients" telling us how to get x back as a linear combination of the basis vectors b1 ... bn.

In matrix terms, x = B'c, since the basis vectors will be columns in B', and the expression B'c will give us a linear combination of the columns of B' using c as the weights. (Note: to keep the gifs down, we're using ' to mean transpose in the body of the text here).

So far this is completely trivial -- we've shown that we can represent x as x = B'Bx, where B is an orthogonal matrix (and hence its inverse is its transpose).

We can re-express this in non-matrix terms as

       (Equation 1)

This is equation is known as the Fourier expansion. In actual Fourier analysis, the basis vectors bi are sinusoids, established independent of the data being modeled -- but nothing about equation 1 requires this choice. In this lecture, we're interested in cases where we choose the basis vectors in a way that depends on the data we're trying to model -- and therefore (we hope) tells us something interesting about the structure of the phenomenon behind the measurements.

For a given orthonormal basis, the coefficients are unique -- there is a unique way to represent an arbitrary x as a linear combination of the new basis vectors B. The representation is also exact -- ignoring round-off errors, transforming to the new coordinates and then transforming back gets us exactly the original data.

Let's spend a minute on the relationship of equation 1 to the matrix form x = B'Bx. The subexpression Bx, which is giving us the vector of "coefficients" (i.e. the new coordinates), is equivalent to the subexpression read "x transpose times b sub i", which is giving us a single coefficient for each value of the index i. Spend a couple of minutes to make sure that you understand why these two equations mean the same thing.

If we truncate Equation 1 after p terms (p < N), we get

       (Equation 2)

We are thus approximating x as a linear combination of some subset of orthogonal basis vectors -- we are projecting it into a p-dimensional subspace. Obviously each choice of p orthonormal vectors gives a different approximation. Within this general range of approaches, there are a lot of alternative criteria that have been used for choosing the set of basis vectors. Oja (Subspace Methods of Pattern Recognition, 1983) gives six:

  1. minimum approximation error (least average distance between x and x hat)
  2. least representation entropy
  3. uncorrelated coefficients (with each other)
  4. maximum variances of the coefficients
  5. maximum separation of two classes in terms of distance in a single subspace
  6. maximum separation of two classes in terms of approximation errors in two subspaces

Criterion no. 4 gives rise to Principal Components Analysis (PCA). Criterion no. 1 gives rise to the Karhunen-Loeve Transform.

There are others in the more recent literature as well, for example statistical independence of the coefficients, which gives rise to Independent Components Analysis (ICA). And Canonical Correlation Analysis, originally proposed by Hotelling in 1936, has recently been revised at Penn, e.g. in Lyle Ungar, "A vector space model of semantics using Canonical Correlation Analysis", 2013.

Principal Components Analysis

For Principal Components Analysis, the orthonormal basis set is given by the eigenvectors of the covariance matrix of the observation vectors x. Since a covariance matrix is real and symmetric, it's guaranteed that the eigenvectors can be chosen orthonormal.

Here is a sketch of why this works. For the first basis vector, we want to find a linear combination of the elements of observation vectors x that will have maximum variance, in other words to find a fixed weighting vector u such that

E((x'u - E(x'u))^2) = maximum (across all observations x)

With a little algebra, this develops as:

E((u'(x-Ex))^2) = u'E((x-Ex)(x-Ex)')u = u'Ru

where R is the covariance matrix of x, and we want u'u =1 (so that u is of unit length). The solution will be the unit eigenvector of R with the highest eigenvalue (for which the value of u'Ru will obviously be that eigenvalue).

Eigenfaces

This is a simple (but surprisingly effective) application of PCA by Turk and Pentland ("Eigenfaces for Recognition," J. Neuroscience, 3(1), 1991).

We start with m q x q pixel images, each arranged as an n=(q^2)-element vector by stringing the pixel values out in row ordering, starting from the upper left of the image. This is the standard way of treating images as vectors. Call these image vectors G1 ... Gm. Note that n will be fairly large: e.g. if q = 256, n = 65,536.

Calculate the "average face" F -- just the element-wise average pixel value -- and subtract it out from all the images, producing a set of mean-corrected image vectors H1 ... Hm = G1-F ... Gm-F.

Calculate the covariance matrix C of the mean-corrected images. If L is an n x m matrix whose columns are the images H1 ... HM, then C = (1/m) LL' (an n-by-n matrix). The normalization by 1/m is not essential -- look through the development below and see if you can figure out why.

Calculate the SVD of C = USU'. The columns of U will be the n-element eigenvectors ("eigenfaces") of the face space. The coefficients of an image x will be calculated by taking the inner product of the columns of U with x. This is an instance of equation 2 above, given that we truncate the expansion at p < n eigenvectors.

Digression: it may be too costly to do this directly, since if n = 65,536, then C is 65,536 x 65,536. As a practical expedient, we can take another route to getting (some of) the eigenvectors. Basically, instead of taking the SVD of C = LL', we take the SVD of L'L (which is only of size m x m, where m is the number of images in our "training set," say a few thousand). Then consider the following:

L'L = WTW'
LL'LL' = LWTW'L'
(LL')^2 = (LW)T(LW)'

Finally, form a face subspace by taking the p most significant eigenvectors. The projection coefficients for any face Gi are just the inner products of the (mean-corrected) face with these eigenvectors. A compact representation of G is just its P projection coefficients.

We can then define a similarity metric as Euclidean distance in the subspace, and use this metric for recognition.

Take a look at photobook, an interactive demonstration of (a slightly improved version of this) eigenfaces recognition.

An alternative method called fisherfaces, which also projects the image space linearly to a low-dimensional subspace.

Dan Lee and H.S. Seung have proposed a similar subspace method, just like PCA but subject to the constraint that none of the coefficients can be negative. This corresponds better to the intuitive notion of making the whole (face for instance) by adding up parts, and (they argue) produces better pattern recognition performance as well as a better model of biological pattern learning.

Latent Semantic Analysis

"Latent semantic analysis" (or "latent semantic indexing") is exactly analogous to the eigenfaces method, but relies on orthogonalizing a "term by document" matrix, that is, a matrix whose rows represent words and whose columns represent documents, with the number in cell i,j indicating how many times word i occurred in document j. One of the original LSA papers was Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. . Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407. (1990), so LSA emerged slightly before Eigenfaces.

Other related terminology/methods

Factor analysis, introduced by Thurstone in 1931, is closely related to PCA (or identical to it in some versions). Multidimensional scaling, developed in the 1950's and 60's, denotes a set of techniques for situating entities in a low-dimensional space based on some set of pair-wise similarity (or distance) measures. Factor analysis (PCA) does this based on the correlation or covariance matrix of a body of data, while MDS may use human similarity judgments, confusion matrices, or other sources of information. MDS may also be applied in a non-metric form, in which the spatial arrangment of entities is constrained by the ordering of their pairwise similarities ("A is more similar to B than it is to C") rather than by similarities viewed as numerical quantities.

Independent Components Analysis (ICA)

The criterion used to pick the new basis vectors in PCA is a very simple one: chose bases that maximize variance of the observed data points, i.e. spread them out as much as possible, considering the new dimensions one at a time. A more subtle criterion might look at the relationship between or among the (projections of the data into the) new dimensions, optimizing some criterion defined on two or more at once. Independent Components Analysis (ICA) is a method of this type. A neat application to the separation of sound sources was the origin of this idea in a 1995 paper by Bell and Sejnowski ("An information-maximization approach to blind separation and blind deconvolution". Neural Computation, 7:1129-1159). A demo of the method can be found at a site in Finland, where you can also get code in Matlab and other languages, and a more extensive page of related resources.

The application shown in the demo cited above is also a good example of a case where the point is not necessarily to reduce the number of dimensions, but to re-express the data in a coordinate system that is more "natural" or more meaningful in some way. In this this, the original dimensions are mixed signals from two sound sources recorded at two microphones; the new dimensions are the original sounds, magically un-mixed!

Fast algorithms for low-rank approximation of very-large-scale dense matrices

See e.g. these slidesxs.

Some cases that don't work
(at least not so easily!)

Suppose we have data points in a three-dimensional space that happen to fall on the surface of a sphere.

Then by construction, the data really only has two degrees of freedom -- say, latitude and longitude -- as opposed to three. But there is no way to choose new basis vectors B in equation (1) above that will reveal this fact to us.

Likewise, if we have points in 2-space that happen all to fall on the circumference of a unit circle centered on the origin, then all the points really live in a 1-dimensional space (as denoted say by the angle from [1 0]). But again, none of these analytic techniques will find this structure for us. The necessary transformation to a new representational basis is a non-linear one.

In many such cases, the data is well approximated as linear within a small region of the original space, although a given linear approximation becomes arbitrarily bad as we move away from the original neighborhood. In this sort of case, a piecewise linear representation may work well. See Sam Roweis & Lawrence Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding", for an approach to discovering a lower-dimensional piecewise-linear approximation, and then unwrapping the result into a single linear coordinate space. Here is a picture from that paper that shows how data lying along a sort of cylindrical spiral can be "unwrapped" by such a method:

A somewhat less obvious case is the set of points in the picture created by a "spirograph" function, which can be predicted as a function of the setting of four variables. If you run the applet on the linked page, you'll see that (setting the color to black, the background to white, and the display type to dots rather than lines) you can adjust the four sliders that control "fixed circle radius", "moving circle radius", "moving circle offset", and "revolutions in radians". Each choice of four numbers gives you a spirograph pattern. For example, the one in the lower-right-hand corner below was created with the vector [66 21 87 143] (in the cited order of coordinates):

We could express each such spirograph dot-pattern as a vector with pair of values for each of the dots in the picture, for example by putting the x values for successive dots in the odd positions and the y values in the even positions. By varying the four settings, we can get get thousands of pairs of columns, one pair for each setting. The resulting data, by construction, in fact "lives" in a four-dimensional space, but we are never going to discover this by looking for new linear axes.

These are cases where the data has a very simple structure underlyingly, one that can easily be expressed in vector space terms, but where "you can't get there from here". At least not by a linear transformation.

There can also be cases where there is simple structure in the data, but not of a kind that can helpfully be expressed in vector space terms even after a non-linear transformation.

Despite these caveats, subspace methods can be very effective, often surprisingly so.

A more topical approach to modeling complex patterns in data is to use some sort of "deep learning", about which we'll learn more later in the course. But note that in many cases, the effect of such methods is to increase the dimensionality of the data (in the interests of classification, prediction, clustering, etc.) rather than to decrease it.

Fast random-sampling or random-projection algorithms for approximate matrix decomposition

With the advent of really big "Big Data", there's been growing interest in fast algorithms for finding subspace structure in really large matrices. For some details and links to software, see e.g. N. Halko et al., "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions", SIAM Review 2011; or the slides from recent presentations at Penn by Per-Gunnar Martinsson ("Randomized Algorithms for Very Large-Scale Linear Algebra") and Dean Foster ("Linear Methods for Large Data"). There's also some relevant material in the NAACL-2013 tutorial by Shay Cohen et al. on "Spectral Learning Algorithms for Natural Language Processing".