Abstract:We analyze the convergence of Gauss-Newton dynamics for training neural networks with smooth activation functions. In the underparameterized regime, the Gauss-Newton gradient flow induces a Riemannian gradient flow on a low-dimensional, smooth, embedded submanifold of the Euclidean output space. Using tools from Riemannian optimization, we prove \emph{last-iterate} convergence of the Riemannian gradient flow to the optimal in-class predictor at an \emph{exponential rate} that is independent of the conditioning of the Gram matrix, \emph{without} requiring explicit regularization. We further characterize the critical impacts of the neural network scaling factor and the initialization on the convergence behavior. In the overparameterized regime, we show that the Levenberg-Marquardt dynamics with an appropriately chosen damping factor yields robustness to ill-conditioned kernels, analogous to the underparameterized regime. These findings demonstrate the potential of Gauss-Newton methods for efficiently optimizing neural networks, particularly in ill-conditioned problems where kernel and Gram matrices have small singular values.

What problem does this paper attempt to address?

This paper attempts to solve the convergence problem of the Gauss - Newton method in neural network training, especially in two different cases of under - parameterization and over - parameterization. Specifically: 1. **Gauss - Newton Dynamics in the Under - Parameterized Case**: - By introducing Riemannian optimization theory, the author proves that the Riemannian gradient flow on a low - dimensional smooth embedded submanifold can be induced by the Gauss - Newton gradient flow (Theorem 4). - Using Riemannian optimization tools, the author establishes certain geodesic strong convexity and Lipschitz continuity results (Lemma 7 and Corollary 1) to analyze the behavior of Gauss - Newton dynamics. - Finally, the author proves that without using explicit regularization, the Gauss - Newton method can converge to the optimal predictor at an exponential rate (Theorem 5). This has a significant advantage compared to first - order methods (such as stochastic gradient descent), which require explicit regularization to ensure the convergence of the average iteration at a sub - exponential rate. 2. **Levenberg - Marquardt Dynamics in the Over - Parameterized Case**: - In the over - parameterized case, the Jacobian Gram matrix \(D^\top f D_f\) is rank - deficient, so it is inevitable to add damping (or regularization) in the pre - conditioner, resulting in Levenberg - Marquardt dynamics. - The author proves the convergence of this method in continuous time and discrete time (Theorem 1 and Theorem 2), and points out that the convergence result of the Gauss - Newton method is independent of the minimum eigenvalue of the neural tangent kernel matrix, thus greatly improving the convergence speed in the case of an ill - conditioned neural tangent kernel. In summary, the main contributions of this paper are: - **Theoretical Innovation**: By introducing Riemannian geometric tools, the Gauss - Newton dynamics in under - parameterized neural network training is studied. - **Algorithm Advantage**: It shows that the Gauss - Newton method can effectively optimize neural networks in both under - parameterized and over - parameterized cases, especially its superiority in ill - conditioned problems. - **Mathematical Rigor**: It provides detailed mathematical proofs to ensure the reliability and verifiability of the conclusions. These findings provide a new perspective for efficient optimization of neural networks and reveal the advantages of the Gauss - Newton method in dealing with ill - conditioned problems.

Gauss-Newton Dynamics for Neural Networks: A Riemannian Optimization Perspective

Gauss Newton method for solving variational problems of PDEs with neural network discretizaitons

Regularized Gauss-Newton for Optimizing Overparameterized Neural Networks

Rethinking Gauss-Newton for learning over-parameterized models

Exact Gauss-Newton Optimization for Training Deep Neural Networks

A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks

Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Gauss-Newton Natural Gradient Descent for Physics-Informed Computational Fluid Dynamics

Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

A convergence analysis of Nesterov’s accelerated gradient method in training deep linear neural networks

Modified Gauss-Newton Algorithms under Noise

Robust Training and Initialization of Deep Neural Networks: An Adaptive Basis Viewpoint

Convergence of the Iterates for Momentum and RMSProp for Local Smooth Functions: Adaptation is the Key

Convergence of Hyperbolic Neural Networks under Riemannian Stochastic Gradient Descent

Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank

Provable convergence of Nesterov’s accelerated gradient method for over-parameterized neural networks

Optimizing Variational Physics-Informed Neural Networks Using Least Squares