Abstract:It is well understood that neural networks with carefully hand-picked weights provide powerful function approximation and that they can be successfully trained in over-parametrized regimes. Since over-parametrization ensures zero training error, these two theories are not immediately compatible. Recent work uses the smoothness that is required for approximation results to extend a neural tangent kernel (NTK) optimization argument to an under-parametrized regime and show direct approximation bounds for networks trained by gradient flow. Since gradient flow is only an idealization of a practical method, this paper establishes analogous results for networks trained by gradient descent.

What problem does this paper attempt to address?

This paper attempts to solve the compatibility problem between approximation ability and optimization efficiency in the training process of neural networks. Specifically, the paper focuses on the following two aspects of problems: 1. **Unified analysis of approximation error and optimization error**: - The approximation error describes the ability of a neural network \( f_{\theta} \) to approximate the target function \( f \) under the \( L^2 \) norm. - The optimization error involves the error generated when training a neural network by methods such as gradient descent. 2. **Performance differences in over - parameterized and under - parameterized regimes**: - In the over - parameterized regime, a neural network can achieve zero training error, but the theoretical results obtained in this case are not fully compatible with traditional approximation theories. - The paper attempts to extend previous results based on gradient flow (the idealized continuous - time limit) to more practical gradient - descent methods and explore its performance in the under - parameterized regime. ### Specific problems - **Performance of shallow networks in the one - dimensional case**: - Train a shallow network \( f_{\theta}(x)=\frac{1}{\sqrt{m}}\sum_{r = 1}^{m}a_r\sigma(x - b_r) \) using gradient descent, where the weights \( a_r \) are fixed as random values of \( \pm1 \), and the biases \( b_r \) are trained. - Study the changes in approximation error and optimization error during the training process under this setting. - **Performance of deep networks in the multi - dimensional case**: - Consider a fully - connected deep network \( f(x)=f_{L + 1}(x) \), where only the weights \( W_{L-1} \) of the second last layer are trained, and the other weights are initialized as random values and remain unchanged. - Explore the approximation error and optimization error under this setting and verify the effectiveness of the neural tangent kernel (NTK) method. ### Key conclusions - For shallow networks, under appropriate conditions, gradient descent can exponentially reduce the error until a final approximation error determined by the initial error and network scale is reached. - For deep networks, similar results also hold, and all assumptions (except for the mandatory conditions of NTK) are easy to verify. Through these studies, the paper aims to bridge the gap between theory and practice, especially to provide a deeper understanding of approximation and optimization problems in neural network training.

Approximation and Gradient Descent Training with Neural Networks

Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits

Function Gradient Approximation with Random Shallow ReLU Networks with Control Applications

Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization

Gradient-enhanced deep neural network approximations

Numerical Analysis on Neural Network Projected Schemes for Approximating One Dimensional Wasserstein Gradient Flows

Approximation and interpolation of deep neural networks

A Gradient Free Neural Network Framework Based on Universal Approximation Theorem

Accelerated CNN Training Through Gradient Approximation

On the convergence of gradient descent for two layer neural networks

Beyond NTK with Vanilla Gradient Descent: A Mean-Field Analysis of Neural Networks with Polynomial Width, Samples, and Time

How many Neurons do we need? A refined Analysis for Shallow Networks trained with Gradient Descent

Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization

Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

Gradient Networks

The limitation of neural nets for approximation and optimization

Convergence of continuous-time stochastic gradient descent with applications to linear deep neural networks

Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks

Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks