Abstract:In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.

What problem does this paper attempt to address?

This paper investigates the phenomenon of loss spikes in neural network training with stochastic gradient descent (SGD) and proposes that these spikes are actually manifestations of the "catapults" optimization phenomenon, which was initially observed in gradient descent (GD) with a large learning rate. The research finds that these loss spikes occur in the low-dimensional subspace of the tensor bases and are associated with better generalization capability, as they facilitate feature learning and increase the alignment between the average gradient outer product (AGOP) of the trained network and the true prediction model. The paper first demonstrates that the spikes in SGD training losses are caused by the dynamic of catapults in GD, and smaller batch sizes in SGD result in more catapults. Through experiments, it is proved that each catapult leads to a reduction in the norm of the tensor bases, consistent with observations in GD. Furthermore, the paper suggests that smaller batch sizes in SGD can increase the number of catapults, thereby improving the alignment of AGOP and testing performance. The research also discovers that catapults contribute to better generalization by enhancing feature learning, which can be quantified by the alignment between the AGOP of the trained network and the true model AGOP. The paper extends previous work and demonstrates that the testing performance continues to improve with an increase in the number of catapults in GD. Additionally, experiments with different optimization algorithms on multiple network architectures and datasets prove that AGOP alignment is an effective indicator of generalization capability. The contribution of the paper lies in connecting the seemingly unrelated issues of loss spikes in SGD training, the catapult phenomenon in GD, and the improved generalization capability of small-batch SGD, and providing experimental evidence to support these findings.

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Loss Spike in Training Neural Networks

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Does SGD really happen in tiny subspaces?

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Enhancing Generalization of Universal Adversarial Perturbation Through Gradient Aggregation

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning

Large Catapults in Momentum Gradient Descent with Warmup: an Empirical Study

The Optimization Landscape of SGD Across the Feature Learning Strength

Asymmetric Valleys: Beyond Sharp and Flat Local Minima.

Stochastic collapse: how gradient noise attracts SGD dynamics towards simpler subnetworks*

High-dimensional SGD aligns with emerging outlier eigenspaces

Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

"Oddball SGD": Novelty Driven Stochastic Gradient Descent for Training Deep Neural Networks

An Alternative View: When Does SGD Escape Local Minima?

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression

Stochastic Gradient Descent and Anomaly of Variance-flatness Relation in Artificial Neural Networks

Generalization for Least Squares Regression With Simple Spiked Covariances