Approximation and Gradient Descent Training with Neural Networks

G. Welper
2024-05-20
Abstract:It is well understood that neural networks with carefully hand-picked weights provide powerful function approximation and that they can be successfully trained in over-parametrized regimes. Since over-parametrization ensures zero training error, these two theories are not immediately compatible. Recent work uses the smoothness that is required for approximation results to extend a neural tangent kernel (NTK) optimization argument to an under-parametrized regime and show direct approximation bounds for networks trained by gradient flow. Since gradient flow is only an idealization of a practical method, this paper establishes analogous results for networks trained by gradient descent.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the compatibility problem between approximation ability and optimization efficiency in the training process of neural networks. Specifically, the paper focuses on the following two aspects of problems: 1. **Unified analysis of approximation error and optimization error**: - The approximation error describes the ability of a neural network \( f_{\theta} \) to approximate the target function \( f \) under the \( L^2 \) norm. - The optimization error involves the error generated when training a neural network by methods such as gradient descent. 2. **Performance differences in over - parameterized and under - parameterized regimes**: - In the over - parameterized regime, a neural network can achieve zero training error, but the theoretical results obtained in this case are not fully compatible with traditional approximation theories. - The paper attempts to extend previous results based on gradient flow (the idealized continuous - time limit) to more practical gradient - descent methods and explore its performance in the under - parameterized regime. ### Specific problems - **Performance of shallow networks in the one - dimensional case**: - Train a shallow network \( f_{\theta}(x)=\frac{1}{\sqrt{m}}\sum_{r = 1}^{m}a_r\sigma(x - b_r) \) using gradient descent, where the weights \( a_r \) are fixed as random values of \( \pm1 \), and the biases \( b_r \) are trained. - Study the changes in approximation error and optimization error during the training process under this setting. - **Performance of deep networks in the multi - dimensional case**: - Consider a fully - connected deep network \( f(x)=f_{L + 1}(x) \), where only the weights \( W_{L-1} \) of the second last layer are trained, and the other weights are initialized as random values and remain unchanged. - Explore the approximation error and optimization error under this setting and verify the effectiveness of the neural tangent kernel (NTK) method. ### Key conclusions - For shallow networks, under appropriate conditions, gradient descent can exponentially reduce the error until a final approximation error determined by the initial error and network scale is reached. - For deep networks, similar results also hold, and all assumptions (except for the mandatory conditions of NTK) are easy to verify. Through these studies, the paper aims to bridge the gap between theory and practice, especially to provide a deeper understanding of approximation and optimization problems in neural network training.