Wednesday, April 28, 2021 - 4:00pm to 4:30pm
Event Calendar Category
LIDS & Stats Tea
Zoom meeting id
922 8352 7745
Join Zoom meeting
Motivated by the classical inventory control problem, we propose a new Q-learning-based algorithm called Elimination-Based Half-Q-Learning (HQL) that enjoys improved efficiency over existing algorithms for a wide variety of problems in the one-sided-feedback setting. In this setting, once an action is taken and the environment randomness has been realized, we assume that we can learn not only the reward for the action taken, but also the rewards for other actions that are "on one side" of the action taken. We establish that HQL incurs tilde~O(H^3 T^(1/2)) regret, and that a simpler variant of our algorithm, FQL, incurs tilde-O(H^2T^(1/2)) regret for the special case of full-feedback setting. In this setting, once an action is taken and the environment randomness has been realized, we can learn the reward associated with every action. Here H is the episode length and T is the horizon length. We remove the regret dependence on the possibly huge state-action space by leveraging the extra feedback that is available in many operations research problems. Numerical experiments confirm the efficiency of HQL and FQL, and show the potential to combine reinforcement learning with richer feedback models.
Xiao-Yue is a PhD student in the Operations Research Center at MIT, advised by Prof. David Simchi-Levi. She currently works on the online algorithms and reinforcement learning for supply chain optimization and revenue management problems. Applications include inventory control, assortment optimization, etc.