Abstract:First-order methods, such as gradient descent (GD) and stochastic gradient descent (SGD), have been proven effective in training neural networks. In the context of over-parameterization, there is a line of work demonstrating that randomly initialized (stochastic) gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. However, the learning rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix, leading to a slow training process. In this paper, we show that for the $L^2$ regression problems, the learning rate can be improved from $\mathcal{O}(\lambda_0/n^2)$ to $\mathcal{O}(1/\|\bm{H}^{\infty}\|_2)$, which implies that GD actually enjoys a faster convergence rate. Furthermore, we generalize the method to GD in training two-layer Physics-Informed Neural Networks (PINNs), showing a similar improvement for the learning rate. Although the improved learning rate has a mild dependence on the Gram matrix, we still need to set it small enough in practice due to the unknown eigenvalues of the Gram matrix. More importantly, the convergence rate is tied to the least eigenvalue of the Gram matrix, which can lead to slow convergence. In this work, we provide the convergence analysis of natural gradient descent (NGD) in training two-layer PINNs, demonstrating that the learning rate can be $\mathcal{O}(1)$, and at this rate, the convergence rate is independent of the Gram matrix.

On the convergence of gradient descent for two layer neural networks

Gradient Descent can Learn Less Over-parameterized Two-layer Neural Networks on Classification Problems

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Stochastic Gradient Descent for Two-layer Neural Networks

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

A Comparative Analysis of Optimization and Generalization Properties of Two-Layer Neural Network and Random Feature Models under Gradient Descent Dynamics

Fast Convergence in Learning Two-Layer Neural Networks with Separable Data

A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

Training Over-parameterized Deep ResNet is Almost As Easy As Training a Two-layer Network

On the Convergence of Gradient Descent for Large Learning Rates

Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks

Convergence of continuous-time stochastic gradient descent with applications to linear deep neural networks

Leveraging the two timescale regime to demonstrate convergence of neural networks

Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off

Analysis of Boundedness and Convergence of Online Gradient Method for Two-Layer Feedforward Neural Networks

A Geometric Approach of Gradient Descent Algorithms in Linear Neural Networks

On the Banach Spaces Associated with Multi-Layer ReLU Networks: Function Representation, Approximation Theory and Gradient Descent Dynamics

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network