Thesis Defense: Collaborative, Open, and Automated Data Science

Wednesday, July 21, 2021 - 1:30pm to 3:00pm

Event Calendar Category

LIDS Thesis Defense

Speaker Name

Micah J. Smith

Affiliation

LIDS

Join Zoom meeting

http://zoom.com

Abstract

Thesis Committee: Kalyan Veeramachaneni (MIT, Chair), Saman Amarasinghe (MIT, Reader), Robert C. Miller (MIT, Reader)
 
Abstract:
 
Data science and machine learning have already revolutionized many industries and organizations, and are increasingly being used in an open-source setting to address important societal problems. However, there remain many challenges to developing predictive machine learning models in practice, such as the complexity of the steps in the modern data science development process, the involvement of many different people with varying skills and roles, and the necessity of, yet difficulty in, collaborating across steps and people. In this thesis, I describe progress in two directions in supporting the development of predictive models.
 
First, I propose to focus the effort of data scientists and support structured collaboration on the most challenging steps in a data science project. In the Ballet framework, we create a new approach to collaborative data science development, based on adapting and extending the open-source software development model for the collaborative development of feature engineering pipelines. Using Ballet as a probe, we conduct a detailed case study analysis of an open-source personal income prediction project in order to better understand data science collaborations.
 
Second, I propose to supplement human collaborators with advanced automated machine learning within end-to-end data science and machine learning pipelines. In the Machine Learning Bazaar, we create a flexible and powerful framework for developing machine learning and automated machine learning systems. In our approach, experts annotate and curate components from different machine learning libraries, which can be seamlessly composed into end-to-end pipelines using a unified interface. We build into these pipelines support for automated model selection and hyperparameter tuning. We use these components to create an open-source, general-purpose, automated machine learning system, and describe several other applications.

Via Zoom, contact micahs[at]mit[dot]edu or henrymi[at]mit[dot]edu for details.