When Your Big Data Seems Too Small: Accurate Inferences Beyond the Empirical Distribution

Tuesday, February 23, 2016 - 4:00pm to Wednesday, February 24, 2016 - 3:55pm

Event Calendar Category

LIDS Seminar Series

Speaker Name

Gregory Valiant

Affiliation

Stanford

Building and Room Number

32-155

We discuss two problems related to the general challenge of making accurate inferences about a complex distribution, in the regime in which the amount of data (i.e the sample size) is too small for the empirical distribution of the samples to be an accurate representation of the underlying distribution. The first problem is the task of accurately estimating the eigenvalues of the covariance matrix of a distribution--the "population spectrum". (These eigenvalues contain basic information about the distribution, including the presence/lack of low-dimensional structure in the distribution and the applicability of many higher-level machine learning and multivariate statistical tools.) As we show, even in the regime where the sample size is linear or sublinear in the dimensionality of the distribution, and hence the eigenvalues and eigenvectors of the empirical covariance matrix are misleading, accurate approximations to the true population spectrum are possible. In the second portion of the talk, we discuss the problem of recovering a low-rank approximation to a matrix of probabilities P, given access to an observed matrix of "counts" obtained via independent samples from the distribution defined by P. This problem can be viewed as a generalization of "community detection", and is relevant to several recent machine learning efforts, including the work on constructing "word embeddings".

Gregory Valiant is an Assistant Professor at Stanford. Prior to his life among the palm trees, he was a postdoc at Microsoft, New England, and obtained his PhD from Berkeley.