Wednesday, October 14, 2020 - 11:00am to 11:30am
Event Calendar Category
LIDS & Stats Tea
Zoom meeting id
921 4123 0377
Join Zoom meeting
While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Adam have been observed to outperform SGD across important tasks, such as NLP models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. Instead, recent theoretical progress shows that SGD is minimax optimal under canonical settings. In this talk, we provide empirical and theoretical evidence that a different smoothness condition or a heavy-tailed distribution of the noise could both result in SGD’s poor performance. Based on this observation, we study clipped variants of SGD that circumvent this issue; we then analyze their convergence and show that adaptive methods can be provably faster than SGD.
Jingzhao is a PhD student working with Suvrit Sra and Ali Jadbabaie. His research interests are broadly in the analysis and design of fast optimization algorithms.