Abstract:A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.

Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions

The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents

The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks

Fundamental computational limits of weak learnability in high-dimensional multi-index models

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

Learning Time-Scales in Two-Layers Neural Networks

Low-dimensional Intrinsic Dimension Reveals a Phase Transition in Gradient-Based Learning of Deep Neural Networks

Does SGD really happen in tiny subspaces?

Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Intrinsic Dimension, Persistent Homology and Generalization in Neural Networks

Multi-scale Feature Learning Dynamics: Insights for Double Descent

Can Shallow Neural Networks Beat the Curse of Dimensionality? A mean field training perspective

Sliding down the stairs: how correlated latent variables accelerate learning with neural networks

High-dimensional SGD aligns with emerging outlier eigenspaces

When and how epochwise double descent happens

Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

Effective Rank and the Staircase Phenomenon: New Insights into Neural Network Training Dynamics