Provably More Efficient Q-Learning in the One-Sided-Feedback/Full-Feedback Settings

Wednesday, April 28, 2021 - 4:00pm to 4:30pm

Event Calendar Category

LIDS & Stats Tea

Speaker Name

Xiao-Yue Gong

Affiliation

ORC

Zoom meeting id

922 8352 7745

Join Zoom meeting

https://mit.zoom.us/j/92283527745

Motivated by the classical inventory control problem, we propose a new Q-learning-based algorithm called Elimination-Based Half-Q-Learning (HQL) that enjoys improved efficiency over existing algorithms for a wide variety of problems in the one-sided-feedback setting. In this setting, once an action is taken and the environment randomness has been realized, we assume that we can learn not only the reward for the action taken, but also the rewards for other actions that are "on one side" of the action taken. We establish that HQL incurs tilde~O(H^3 T^(1/2)) regret, and that a simpler variant of our algorithm, FQL, incurs tilde-O(H^2T^(1/2)) regret for the special case of full-feedback setting. In this setting, once an action is taken and the environment randomness has been realized, we can learn the reward associated with every action. Here H is the episode length and T is the horizon length. We remove the regret dependence on the possibly huge state-action space by leveraging the extra feedback that is available in many operations research problems. Numerical experiments confirm the efficiency of HQL and FQL, and show the potential to combine reinforcement learning with richer feedback models.

Xiao-Yue is a PhD student in the Operations Research Center at MIT, advised by Prof. David Simchi-Levi. She currently works on the online algorithms and reinforcement learning for supply chain optimization and revenue management problems. Applications include inventory control, assortment optimization, etc.