Lessons from AlphaZero for Control System Design and Discrete Optimization

Tuesday, October 18, 2022 - 4:00pm to 5:00pm

Event Calendar Category

LIDS Seminar Series

Speaker Name

Dimitri Bertsekas



Building and Room Number



We focus on a new conceptual framework for approximate Dynamic Programming (DP) and Reinforcement Learning (RL). It revolves around two algorithms that operate in synergy through the powerful mechanism of Newton's method, applied to Bellman’s equation for solving the underlying DP problem. These are the off-line training and the on-line play algorithms, and they are exemplified by the architectures of the AlphaZero program (which plays chess), the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon), and the model predictive control (MPC) architecture (which is central in control system design).

The player obtained through extensive off-line training in AlphaZero, is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used, which is based on multistep lookahead and a terminal position evaluator that was trained using experience with the off-line player. The on-line player performs a form of policy improvement, which unlike the off-line player, is not degraded by neural network approximations. As a result, it greatly improves the performance of the off-line player.

An important lesson from AlphaZero is that the performance of an off-line trained controller can be greatly improved by on-line approximation in value space, with long lookahead (involving minimization or rollout with an off-line obtained policy, or both), and terminal cost approximation that is obtained off-line. The performance enhancement is often dramatic and is due to a simple fact, which is the focal point of this lecture: On-line approximation in value space with one-step lookahead amounts to a single step of Newton’s method for solving Bellman’s equation, while the starting point for the Newton step is based on the results of off-line training, and is enhanced by longer on-line lookahead minimization and on-line rollout. We will aim to explain the beneficial effects of this synergism, and to provide a conceptual bridge between the sequential decision making cultures of artificial intelligence/RL, control theory/MPC, and discrete optimization/integer programming. In particular, we show through the unifying principles of abstract DP and the algorithmic ideas of Newton’s method that the AlphaZero/TD-Gammon ideas apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization and integer programming.

Note: The lecture is based on the 2022 book "Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control” (.pdf freely available on-line), and is supported by analysis from the earlier 2020 book “Rollout, Policy Iteration, and Distributed Reinforcement Learning”.


Dimitri Bertsekas' undergraduate studies were in engineering at the National Technical University of Athens, Greece. He obtained his MS in electrical engineering at the George Washington University, Wash. DC in 1969, and his PhD in system science in 1971 at MIT.

Dr. Bertsekas has held faculty positions at Stanford University (1971-1974) and the University of Illinois (1974-1979). From 1979 to 2019 he has a professor at the EECS Department of MIT. In 2019, he was appointed Fulton Professor of Computational Decision Making, at the School of Computing and Augmented Intelligence at Arizona State University (ASU). His research spans several fields, including optimization, control, and large-scale computation, and is closely tied to his teaching and book authoring activities. He has written numerous research papers, and nineteen books and research monographs, several of which are used as textbooks in MIT and ASU classes.

Recent Professor Bertsekas’ awards include the 2014 ACC Richard Bellman Control Heritage Award, the 2014 INFORMS Khachiyan Prize, the 2015 SIAM/MOS Dantzig Prize, and the 2022 IEEE Control Systems Award. Together with his coauthor John Tsitsiklis, he was awarded the 2018 INFORMS John von Neumann Theory Prize, for the contributions of the research monographs "Parallel and Distributed Computation" and "Neuro-Dynamic Programming". In 2001, he was elected to the United States National Academy of Engineering for "pioneering contributions to fundamental research, practice and education of optimization/control theory, and especially its application to data communication networks."

Dr. Bertsekas' recent books are "Convex Optimization Theory" (2009), "Dynamic Programming and Optimal Control," Vol. I, (2017), and Vol. II: (2012), "Convex Optimization Algorithms" (2015), "Reinforcement Learning and Optimal Control" (2019), "Rollout, Policy Iteration, Distributed Reinforcement Learning" (2020), "Abstract Dynamic Programming" (2022, 3rd edition), and "Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control" (2022), all published by Athena Scientific.

This talk is part of the LIDS Seminar Series. The 2021-22 LIDS Seminar Series is sponsored by LIDS alumnus, George H. Polychronopoulos, MS ’88, PhD ’92