Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport

Raphaël Barboni,Gabriel Peyré,François-Xavier Vialard

2024-03-20

Abstract:We study the convergence of gradient flow for the training of deep neural networks. If Residual Neural Networks are a popular example of very deep architectures, their training constitutes a challenging optimization problem due notably to the non-convexity and the non-coercivity of the objective. Yet, in applications, those tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a ``mean-field'' model of infinitely deep and arbitrarily wide ResNet, parameterized by probability measures over the product set of layers and parameters and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have proven to benefit from simplified loss-landscapes and good theoretical guarantees when trained with gradient flow for the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional Optimal Transport distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces we first show the well-posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak-Łojasiewicz analysis, we then show convergence of the gradient flow for well-chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges towards a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets.

Machine Learning,Optimization and Control

What problem does this paper attempt to address?

This paper discusses deep neural networks, especially the training convergence problem of infinitely deep and arbitrarily wide residual networks (ResNets). In this study, the authors propose a "mean-field" model for training this type of network using Conditional Optimal Transport (COT) distance. This model is parameterized as a probability measure and has constant marginals across layers. The main contributions of the paper include: 1. Proposing a gradient flow method for infinite-depth and arbitrary-width ResNets based on COT distance, which is different from the commonly used Wasserstein distance. 2. Proving that with appropriate initialization, when the number of features is large enough and the initial risk is sufficiently small, the gradient flow converges to the global minimum. 3. Conducting a thorough study of the COT distance, especially its dynamic form, with some results of independent interest. The paper simulates the behavior of infinitely wide networks by considering probability measures as parameters and analyzes the landscape of the loss function using gradient flow theory to understand why simple optimization algorithms like gradient descent can successfully find the global minimum on non-convex and non-compact objective functions. The study also involves the training dynamics of ResNets and how they are trained in practice through layer-wise L2 norm minimization, which is consistent with COT distance. Furthermore, the paper compares previous work and points out that although other studies have also focused on training infinitely deep networks, they either do not provide convergence proofs or require additional regularization or assumptions, while this paper provides convergence results for a risk function without regularization.

Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport

Generalization of Scaled Deep ResNets in the Mean-Field Regime

Computing high-dimensional optimal transport by flow neural networks

Normalized gradient flow optimization in the training of ReLU artificial neural networks

Global Convergence in Training Large-Scale Transformers

Field theory for optimal signal propagation in ResNets

Speed Limits for Deep Learning

On Convergence of Training Loss Without Reaching Stationary Points

Learning with minibatch Wasserstein : asymptotic and gradient properties

The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program

Deep Limits of Residual Neural Networks

Scaling ResNets in the Large-depth Regime

Doubly infinite residual neural networks: a diffusion process approach

Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH)

Absence of Closed-Form Descriptions for Gradient Flow in Two-Layer Narrow Networks

Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

An Optimal Transport Analysis on Generalization in Deep Learning

Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks

Towards a General Theory of Infinite-Width Limits of Neural Classifiers

Topological obstruction to the training of shallow ReLU neural networks

Neural Sinkhorn Gradient Flow