Wednesday, February 2, 2022 - 4:00pm to 4:30pm
Event Calendar Category
LIDS & Stats Tea
Sung Min (Sam) Park
Building and Room Number
Current supervised machine learning models rely on an abundance of training data. Yet, understanding the underlying structure and biases of this data—and how they impact models—remains challenging. We present a new conceptual framework, datamodels, for directly modeling predictions as functions of training data. We instantiate our framework with simple parametric models (e.g. linear) and apply it to deep neural networks trained on standard vision datasets. Despite the complexity of the underlying process (e.g. SGD on overparameterized neural networks), the resulting datamodels can accurately predict model outputs as linear functions of the presence of different training examples. These datamodels, in turn, give rise to powerful tools for analyzing the ML pipeline: a predictive model for counterfactual impact of removing different training data; a data similarity metric that is faithful to the model class under study; and a rich embedding and graph that gives a principled way to study latent structure in the data.
Sung Min (Sam) Park is an PhD student in CSAIL, advised by Prof. Aleksander Madry. His research interests include machine learning foundations, with a focus on making deployment of models more robust and reliable.