Thesis Defense: Data Efficient Reinforcement Learning

Friday, May 7, 2021 - 2:00pm to 3:00pm

Event Calendar Category

LIDS Thesis Defense

Speaker Name

Zhi Xu

Affiliation

LIDS

Zoom meeting id

660342

Join Zoom meeting

https://mit.zoom.us/j/92814551473

Abstract

Thesis Committee: Devavrat Shah (thesis advisor), Leslie Kaelbling, John Tsitsiklis

Reinforcement learning (RL) has recently emerged as a generic yet powerful solution for learning complex decision-making policies, providing the key foundational underpinnings of recent successes in various domains. However, many state-of-the-art algorithms are data-hungry and computationally expensive, requiring large amounts of data to succeed. While this is possible for certain scenarios, in applications arising in social sciences and healthcare for example, where available data is sparse, this naturally can be costly or infeasible. With the surging interest in applying RL to broader domains, it is imperative to develop an informed view about the usage of data involved in its algorithmic design.

This thesis hence focuses on studying the data efficiency of RL, through a structural perspective. Advancement along this direction naturally requires us to understand when and why algorithms are successful to begin with; and building upon such understanding, further improve the data efficiency of RL. To this end, this thesis begins by taking inspiration from the empirical successes. We consider the popular use of simulation-based Monte Carlo Tree Search (MCTS) in RL, as exemplified by the remarkable achievement of AlphaGo Zero, and probe the data efficiency of incorporating such a key ingredient. Specifically, we investigate the correct form to utilize such a tree structure for estimating values and characterize the corresponding data complexity. These results further enable us to analyze the data complexity of an RL algorithm that combines MCTS with supervised learning.

Having developed our understanding, as a next step, we improve the algorithmic designs of simulation-based data-efficient RL algorithms that have access to a generative model. We provide such improvements for both bounded and unbounded spaces. Our first contribution is a structural framework through a novel lens of low-rank representation of the Q-function. The proposed data-efficient RL algorithm exploits the low-rank structure to perform pseudo-exploration via a new matrix estimation technique. Remarkably, this leads to a significant (exponential) improvement in data complexity. Moving to our endeavor with unbounded spaces, one must first address the unique conceptual challenges incurred by the unbounded domains. Inspired by classical queueing systems, we propose an appropriate notion of stability for quantifying “goodness" of policies. Subsequently, by leveraging the stability structure of the underlying systems, we design efficient, adaptive algorithms with a modified, efficient Monte Carlo oracle that guarantee the desired stability with a favorable data complexity that is polynomial with respect to the parameters of interest.

Altogether, through new analytical tools and structural frameworks, this thesis contributes to the design and analysis of data-efficient RL algorithms.