Model Agnostic High-Dimensional Error-In-Variable Regression

Wednesday, March 6, 2019 - 3:00pm to Thursday, March 7, 2019 - 3:55pm

Event Calendar Category

LIDS & Stats Tea

Speaker Name

Anish Agarwal



Building and Room Number

LIDS Lounge


We consider the problem of high-dimensional error-in-variable regression where instead of directly observing the covariates, we observe its sparse, noisy version, a common thread of modern datasets. For this setup, we propose an algorithm that utilizes matrix estimation on the corrupted data as a key subroutine and then performs ordinary least squares regression with the de-noised covariates. If we instantiate the matrix estimation subroutine with hard singular value thresholding (HSVT), then our results indicate that as long as the number of samples scales as ρ−4r log4(p), then our in- and out-of-sample prediction error decays to 0 as p → ∞; here, ρ represents the fraction of observed (noisy) covariates, r is the (approximate) rank of the true covariate matrix, and p denotes the ambient dimension (number of predictors). As an important byproduct of our approach, we demonstrate that HSVT with regression acts as a form of implicit l_0-regularization since HSVT aims to find a low- rank structure within the covariance matrix. Thus, we can view the sparsity of the regression vector as a consequence of the covariate structure rather than a model assumption as is often considered in the literature. Moreover, under HSVT, our finite sample bounds match (up to log3(p) factors) the best-guaranteed sample complexity results in the literature for algorithms that require precise knowledge of the underlying model; we highlight that our approach, in contrast, is model agnostic. In the process of our analysis, we obtain two technical results of independent interest: first, we provide a spectral norm bound for random matrices with independent sub-exponential rows with randomly missing data; second, we bound a nonstandard error metric, the max column sum error, for HSVT. Our setting enables us to apply our results to applications such as synthetic control for causal inference, time series analysis, and regression with privacy. It is important to note that the existing inventory of methods is unable to analyze these applications.

Co-Authors: Devavrat Shah, Dennis Shen, Dogyoon Song


Anish Agarwal is a 3rd year PhD student in EECS at MIT co-advised by Munther Dahleh and Devavrat Shah. He received his BS and MS degree in Computer Science from Caltech. His research interests are in high-dimensional statistics and the design of data marketplaces.