Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport

Raphaël Barboni,Gabriel Peyré,François-Xavier Vialard
2024-03-20
Abstract:We study the convergence of gradient flow for the training of deep neural networks. If Residual Neural Networks are a popular example of very deep architectures, their training constitutes a challenging optimization problem due notably to the non-convexity and the non-coercivity of the objective. Yet, in applications, those tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a ``mean-field'' model of infinitely deep and arbitrarily wide ResNet, parameterized by probability measures over the product set of layers and parameters and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have proven to benefit from simplified loss-landscapes and good theoretical guarantees when trained with gradient flow for the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional Optimal Transport distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces we first show the well-posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak-Łojasiewicz analysis, we then show convergence of the gradient flow for well-chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges towards a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
This paper discusses deep neural networks, especially the training convergence problem of infinitely deep and arbitrarily wide residual networks (ResNets). In this study, the authors propose a "mean-field" model for training this type of network using Conditional Optimal Transport (COT) distance. This model is parameterized as a probability measure and has constant marginals across layers. The main contributions of the paper include: 1. Proposing a gradient flow method for infinite-depth and arbitrary-width ResNets based on COT distance, which is different from the commonly used Wasserstein distance. 2. Proving that with appropriate initialization, when the number of features is large enough and the initial risk is sufficiently small, the gradient flow converges to the global minimum. 3. Conducting a thorough study of the COT distance, especially its dynamic form, with some results of independent interest. The paper simulates the behavior of infinitely wide networks by considering probability measures as parameters and analyzes the landscape of the loss function using gradient flow theory to understand why simple optimization algorithms like gradient descent can successfully find the global minimum on non-convex and non-compact objective functions. The study also involves the training dynamics of ResNets and how they are trained in practice through layer-wise L2 norm minimization, which is consistent with COT distance. Furthermore, the paper compares previous work and points out that although other studies have also focused on training infinitely deep networks, they either do not provide convergence proofs or require additional regularization or assumptions, while this paper provides convergence results for a risk function without regularization.