Smoothness and Adaptivity in Nonlinear Optimization for Machine Learning Applications

Monday, May 6, 2024 - 10:00am to 11:30am

Event Calendar Category

LIDS Thesis Defense

Speaker Name

Haochuan Li

Affiliation

LIDS

Building and Room number

E18-304

Join Zoom meeting

Smoothness and Adaptivity in Nonlinear Optimization for Machine Learning Applications

Abstract

Nonlinear optimization has become the workhorse of machine learning. However, our theoretical understanding of optimization in machine learning is limited. For example, classical optimization theory relies on assumptions like bounded Lipschitz smoothness of the loss function which are rarely met in machine learning. Besides, existing theory cannot well explain why adaptive methods outperform gradient descent in certain machine learning tasks like training transformers. In this thesis, to bridge this gap, we propose more general smoothness conditions that are closer to machine learning practice, and study the convergence of popular classical and adaptive methods under such conditions.  Our convergence results improve over existing ones and also provide new insights on understanding the role of adaptivity in optimization for machine learning applications.

 

First, inspired by some recent works and insights from deep neural network training, we propose a generalized non-uniform smoothness condition with the Hessian norm bounded by a function of the gradient norm almost everywhere. We develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for gradient descent (GD), stochastic gradient descent (SGD), and Nesterov's accelerated gradient method (NAG) in the convex or non-convex settings under this general smoothness condition.

 

In addition, the new analysis technique also allows us to obtain an improved convergence result for the Adaptive Moment Estimation (Adam) method.  Despite the popularity and efficiency of Adam in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this thesis, we show that Adam provably converges to stationary points under far more realistic conditions. In particular, we do not require the strong assumptions made in previous works, and also consider the generalized smoothness condition.

 

However, the above results can not explain why adaptive methods like Adam significantly outperform SGD in machine learning applications like training transformers, as the convergence rate we have obtained for Adam is not faster than that of SGD. To better understand the role of adaptivity in optimization for machine learning, we propose a directional smoothness condition motivated from transformer experiments. Under this condition, we are able to obtain faster convergence rates for certain adaptive methods inlcuding memoryless Adam and RMSProp compared to SGD.