Abstract:We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter $\gamma$. Recent work has identified $\gamma$ as controlling the strength of feature learning. As $\gamma$ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling $\gamma$ across a variety of models and datasets in the online training setting. We first examine the interaction of $\gamma$ with the learning rate $\eta$, identifying several scaling regimes in the $\gamma$-$\eta$ plane which we explain theoretically using a simple model. We find that the optimal learning rate $\eta^*$ scales non-trivially with $\gamma$. In particular, $\eta^* \propto \gamma^2$ when $\gamma \ll 1$ and $\eta^* \propto \gamma^{2/L}$ when $\gamma \gg 1$ for a feed-forward network of depth $L$. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" $\gamma \gg 1$ regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large $\gamma$ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large $\gamma$ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-$\gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.

Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.

The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Why (and When) does Local SGD Generalize Better than SGD?

Stochastic collapse: how gradient noise attracts SGD dynamics towards simpler subnetworks*

Noisy Truncated SGD: Optimization and Generalization

Understanding the Generalization Benefits of Late Learning Rate Decay

Revisiting the Noise Model of Stochastic Gradient Descent

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

On the different regimes of Stochastic Gradient Descent

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

The Optimization Landscape of SGD Across the Feature Learning Strength

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

The Implicit Biases of Stochastic Gradient Descent on Deep Neural Networks with Batch Normalization

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult

Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects