Modeling and Learning Deep Representations, in Theory and in Practice

Thursday, October 12, 2017 - 2:30pm

Event Calendar Category

LIDS Seminar Series

Speaker Name

Stefano Soatto

Affiliation

University of California, Los Angeles and Amazon AI

Building and Room number

34-401 (Grier Room)

A few things about Deep Learning I find puzzling: 1) How can deep neural networks — optimized by stochastic gradient descent (SGD) agnostic of concepts of invariance, minimality, disentanglement — somehow manage to learn representations that exhibit those traits? 2) How do these networks, that can overfit random labels, somehow manage to generalize? 3) How can the "flatness" of minima of the training loss be related to generalization, when flatness is coordinate-dependent?

To tackle these questions I will 1) describe a tight bound between the amount of information in the weights of the network and total correlation (a measure of disentanglement), minimality and invariance of the resulting representation; 2) show that if complexity is measured by the information in the parameters (not their number), deep networks follow the bias-variance tradeoff faithfully and there is no need to "rethink" generalization; 3) show that the nuclear norm of the Hessian (a measure of flatness) bounds the information in the weights, which is the regularizer that guarantees the representation to be minimal, sufficient, invariant and maximally disentangled. The resulting information-theoretic framework can predict a sharp phase transition between overfitting and underfitting for random labels, and quantify the amount of information needed to overfit a given dataset with a given network to the fraction of a NAT. The theory has connections with variational Bayesian inference, the Information Bottleneck principle, and PAC-Bayes bounds.

Once a regularized loss function is in place, learning a representation amounts to solving a high-dimensional, non-convex optimization problem. In the second part of the talk, I will highlight some peculiarities of the geometry of the loss surface, and describe Entropy-SGD, an algorithm designed to exploit them using insights from statistical physics. As it turns out, Entropy-SGD computes the solution of a viscous Hamilton-Jacobi partial differential equation (PDE), which leads to a stochastic optimal control version of SGD that is faster than the vanilla version. In the non-viscous limit, the PDE leads to the classical proximal point iteration via the Hopf-Lax formula. The analysis establishes connections between statistical physics, non-convex optimization and the theory of PDEs. Moreover, Entropy-SGD includes as special cases distributed algorithms popular in Deep Learning, e.g., Elastic-SGD, which it outperforms leading to state-of-the-art generalization error with optimal convergence rates, without extra hyper-parameters to tune.

Joint work with Alessandro Achille, Pratik Chaudhari and others https://arxiv.org/abs/1706.01350, https://arxiv.org/abs/1704.04932, https://arxiv.org/abs/1707.00424.

Professor Soatto received his Ph.D. in Control and Dynamical Systems from the California Institute of Technology in 1996; he joined UCLA in 2000 after being Assistant and then Associate Professor of Electrical and Biomedical Engineering at Washington University, and Research Associate in Applied Sciences at Harvard University. Between 1995 and 1998 he was also Ricercatore in the Department of Mathematics and Computer Science at the University of Udine - Italy. He received his D.Ing. degree (highest honors) from the University of Padova- Italy in 1992.

His general research interests are in Computer Vision and Nonlinear Estimation and Control Theory. In particular, he is interested in ways for computers to use sensory information (e.g. vision, sound, touch) to interact with humans and the environment.

Dr. Soatto is the recipient of the David Marr Prize (with Y. Ma, J. Kosecka and S. Sastry of U.C. Berkeley) for work on Euclidean reconstruction and reprojection up to subgroups. He also received the Siemens Prize with the Outstanding Paper Award from the IEEE Computer Society for his work on optimal structure from motion (with R. Brockett of Harvard). He received the National Science Foundation Career Award and the Okawa Foundation Grant. He is Associate Editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and a Member of the Editorial Board of the International Journal of Computer Vision (IJCV) and Foundations and Trends in Computer Graphics and Vision.