Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei,Jason D. Lee,Qiang Liu,Tengyu Ma
DOI: https://doi.org/10.48550/arXiv.1810.05369
2020-04-25
Abstract:Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard $\ell_2$ regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in d dimensions which the optimal regularized neural net learns with $O(d)$ samples but the NTK requires $\Omega(d^2)$ samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable. We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.
Machine Learning
What problem does this paper attempt to address?
### The problems the paper attempts to solve What this paper attempts to solve is the role of regularization in the generalization of neural networks, especially when compared with kernel methods induced by neural networks (such as the Neural Tangent Kernel, NTK). Specifically, the author studies the impact of explicit regularization on the optimization and generalization performance of neural networks, especially in the case of over - parameterization. ### Main contributions 1. **Comparison of generalization performance**: - The author constructs a simple data distribution \(D\), on which a two - layer neural network optimizing the explicit regularization loss only needs \(O(d)\) samples to learn, while the method using NTK requires \(\Omega(d^{2})\) samples to learn. This shows that regularization can significantly improve sample efficiency. 2. **Optimization theory**: - The author proves that for a two - layer neural network with infinite width, noisy gradient descent can find the global optimal solution in polynomial time. This result improves previous work, which, although it studied the optimization problem in the infinite - width limit, did not provide a polynomial convergence rate. 3. **Maximum - margin solution**: - The author proves that under weak regularization conditions, the optimized neural network can reach the maximum - normalized - margin solution. This result applies not only to two - layer ReLU networks but also to feed - forward ReLU networks of any depth and width. The maximum - margin solution usually has better generalization performance. ### Specific content #### 1. Analysis of generalization performance - **Theorem 2.1**: Consider a two - layer neural network with ReLU activation, with the goal of achieving a small generalization error on distribution \(D\). Using \(o(d^{2})\) samples, no function in the RKHS induced by NTK can successfully learn \(D\). On the other hand, the global optimal solution of the \(\ell_{2}\)-regularized logistic loss can learn \(D\) with \(O(d)\) samples. - **Intuition**: Regularization allows the neural network to obtain a better margin than the fixed NTK kernel, thus achieving better generalization performance. #### 2. Optimization theory - **Theorem 3.3**: For a two - layer network with infinite width, noisy gradient descent with \(\ell_{2}\)-regularized loss can find the global optimal solution in polynomial time. This result improves previous work, which, although it studied the optimization problem in the infinite - width limit, did not provide a polynomial convergence rate. - **Perturbed Wasserstein gradient flow**: The author proposes a modified dynamic equation. By adding very small uniform noise, it ensures that there is enough mass moving along the descent direction at each time step, thus ensuring that the algorithm can reduce the objective function in polynomial time. #### 3. Maximum - margin solution - **Theorem 4.1**: Suppose the training data can be separated by a network \(f(\cdot; \Theta^{*})\) and the optimal normalized margin \(\gamma^{*}>0\). Then, the normalized margin \(\gamma_{\lambda}\) of the global optimal solution of the weakly - regularized objective function (formula 4.1) approaches the maximum margin \(\gamma^{*}\) as the regularization level \(\lambda\) approaches zero. - **Corollary 4.2**: Combining the existing Rademacher complexity bounds, it can be concluded that the optimal solution of the weakly - regularized logistic loss has a generalization bound independent of the width related to the reciprocal of the maximum margin and the network depth. ### Conclusion Through theoretical analysis and experimental verification, this paper shows the effectiveness of explicit regularization in improving the generalization performance and optimization efficiency of neural networks. In particular, regularization can help neural networks learn better with a limited number of samples, and in a two - layer network with infinite width, noisy gradient descent can find the global optimal solution in polynomial time. These results provide a new perspective for further understanding the over - parameterization phenomenon in deep learning.