Datamodels: Understanding Model Predictions as functions of Data

Wednesday, February 2, 2022 - 4:00pm to 4:30pm

Event Calendar Category

LIDS & Stats Tea

Speaker Name

Sung Min (Sam) Park

Affiliation

CSAIL

Building and Room Number

LIDS Lounge

Current supervised machine learning models rely on an abundance of training data. Yet, understanding the underlying structure and biases of this data—and how they impact models—remains challenging. We present a new conceptual framework, datamodels, for directly modeling predictions as functions of training data. We instantiate our framework with simple parametric models (e.g. linear) and apply it to deep neural networks trained on standard vision datasets. Despite the complexity of the underlying process (e.g. SGD on overparameterized neural networks), the resulting datamodels can accurately predict model outputs as linear functions of the presence of different training examples. These datamodels, in turn, give rise to powerful tools for analyzing the ML pipeline: a predictive model for counterfactual impact of removing different training data; a data similarity metric that is faithful to the model class under study; and a rich embedding and graph that gives a principled way to study latent structure in the data.

Sung Min (Sam) Park is an PhD student in CSAIL, advised by Prof. Aleksander Madry. His research interests include machine learning foundations, with a focus on making deployment of models more robust and reliable.