Abstract:Two distinct limits for deep learning have been derived as the network width $h\rightarrow \infty$, depending on how the weights of the last layer scale with $h$. In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel $\Theta$. By contrast, in the Mean-Field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as $\alpha h^{-1/2}$ at initialization. By varying $\alpha$ and $h$, we probe the crossover between the two limits. We observe the previously identified regimes of lazy training and feature training. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that (i) The two regimes are separated by an $\alpha^*$ that scales as $h^{-1/2}$. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations $\delta F$ induced on the learned function by initial conditions decay as $\delta F\sim 1/\sqrt{h}$, leading to a performance that increases with $h$. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks. (iv) In the feature-training regime we identify a time scale $t_1\sim\sqrt{h}\alpha$, such that for $t\ll t_1$ the dynamics is linear.

Super Consistency of Neural Network Landscapes and Learning Rate Transfer

The instabilities of large learning rate training: a loss landscape view

The Optimization Landscape of SGD Across the Feature Learning Strength

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Unraveling the Hessian: A Key to Smooth Convergence in Loss Function Landscapes

Scaling ResNets in the Large-depth Regime

Disentangling feature and lazy training in deep neural networks

Beyond the Quadratic Approximation: the Multiscale Structure of Neural Network Loss Landscapes

Visualizing the Loss Landscape of Neural Nets

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

The large learning rate phase of deep learning: the catapult mechanism

The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin

Neglected Hessian component explains mysteries in Sharpness regularization

Scaling Laws for Transfer

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

High dimensional analysis reveals conservative sharpening and a stochastic edge of stability

Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

A Large-Scale Exploration of $μ$-Transfer

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Feature-Learning Networks Are Consistent Across Widths At Realistic Scales

Investigating generalization capabilities of neural networks by means of loss landscapes and Hessian analysis