Is Gradient Descent Doing Gradient DESCENT in Deep Learning?

Wednesday, February 24, 2021 - 11:00am to Thursday, February 25, 2021 - 11:55am

Event Calendar Category

Other LIDS Events

Speaker Name

Yuanzhi Li

Affiliation

Carnegie Mellon University

Gradient descent and its variants are the most widely applied methods to train large-scale neural networks in practice. On the theory side, traditional optimization theory typically attributes the success of gradient descent to the simple "gradient descent lemma": using a proper learning rate (typically smaller than the inverse local smoothness of the objective function), following the gradient direction ensures that the objective decreases monotonically. This work argues that gradient descent in deep learning operates on a completely different regime called "the edge of stability": We empirically show that using standard learning rates, gradient descent on deep learning objectives quickly enters a region where the local smoothness of the loss function hovers right above 2/(learning rate) -- After that, the loss starts to change largely non-monotonically, although still decreasing in the long run. Our empirical observation applied when using gradient descent to train standard convolution nets (with/without BN) on image data sets, transformer model on language modeling, or even simple deep linear networks.

We also give proof of this phenomenon in a special case of a two-layer neural network. We will also discuss the edge of stability of stochastic gradient descent and momentum methods.

Yuanzhi Li is an Assistant Professor in the Machine Learning Department at CMU (2019.9-present). Previously, he was a Postdoc in the Computer Science Department at Stanford University (2018.9-2019.8). He obtained his Ph.D. in computer science at Princeton University (2014.9-2018.8) under the advise of Sanjeev Arora and his B.S.E. in computer science and mathematics at Tsinghua University (2010.9-2014.8). His research focuses on Machine Learning, with an aim to design efficient and provable algorithms for practical machine learning problems. He also works on convex/non-convex optimization.