Wednesday, October 12, 2016 - 4:30pm
Event Calendar Category
LIDS & Stats Tea
Building and Room Number
A very powerful tool for dimension reduction in many applications is principal component analysis (PCA), which really just means to project your data matrix onto its most significant directions, i.e. its top singular vectors, and hope those are the most meaningful features. But usually the features that PCA gives are not really interpretable--they're some weird linear combination of all the features you started out with. Moreover, to get statistical guarantees for PCA you'll usually need a number of samples which depends at least linearly on the dimension of your data, which can be prohibitive if you're doing "big data" stuff or whatever and you have extremely high dimensional data.
This motivates the study of sparse PCA, where you are somehow guaranteed that the meaningful directions you are searching for are a sparse combination of the features you started out with. For simplicity, we will consider a very simple toy model, namely, the "spiked covariance" model of sparse PCA. Here we assume our data points are iid generated from where is a -sparse unit vector, and our job is to recover as well as possible. Even in this simple setting, there are some very interesting statistical and computational tradeoffs. For constant , one can show that samples suffice to recover to constant accuracy, however, the only efficient algorithms we know of require samples. Moreover, assuming widely-believed hardness assumptions about another average case problem, the "planted clique" problem, one can show that samples are required for efficient sparse PCA. In this talk I'll go over the upper bounds, and will try to motivate at a high level why it seems hard to beat with an efficient algorithm.