An Introduction to Sparse PCA

Wednesday, October 12, 2016 - 4:30pm

Event Calendar Category

LIDS & Stats Tea

Speaker Name

Jerry Li

Affiliation

CSAIL

Building and Room Number

LIDS Lounge

A very powerful tool for dimension reduction in many applications is principal component analysis (PCA), which really just means to project your data matrix onto its most significant directions, i.e. its top singular vectors, and hope those are the most meaningful features. But usually the features that PCA gives are not really interpretable--they're some weird linear combination of all the features you started out with. Moreover, to get statistical guarantees for PCA you'll usually need a number of samples which depends at least linearly on the dimension of your data, which can be prohibitive if you're doing "big data" stuff or whatever and you have extremely high dimensional data.

This motivates the study of sparse PCA, where you are somehow guaranteed that the meaningful directions you are searching for are a sparse combination of the features you started out with. For simplicity, we will consider a very simple toy model, namely, the "spiked covariance" model of sparse PCA. Here we assume our data points  are iid generated from  where  is a -sparse unit vector, and our job is to recover as well as possible. Even in this simple setting, there are some very interesting statistical and computational tradeoffs. For constant , one can show that samples suffice to recover  to constant accuracy, however, the only efficient algorithms we know of require  samples. Moreover, assuming widely-believed hardness assumptions about another average case problem, the "planted clique" problem, one can show that  samples are required for efficient sparse PCA. In this talk I'll go over the upper bounds, and will try to motivate at a high level why it seems hard to beat  with an efficient algorithm.