Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Libin Zhu,Chaoyue Liu,Adityanarayanan Radhakrishnan,Mikhail Belkin
2024-06-06
Abstract:In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
This paper investigates the phenomenon of loss spikes in neural network training with stochastic gradient descent (SGD) and proposes that these spikes are actually manifestations of the "catapults" optimization phenomenon, which was initially observed in gradient descent (GD) with a large learning rate. The research finds that these loss spikes occur in the low-dimensional subspace of the tensor bases and are associated with better generalization capability, as they facilitate feature learning and increase the alignment between the average gradient outer product (AGOP) of the trained network and the true prediction model. The paper first demonstrates that the spikes in SGD training losses are caused by the dynamic of catapults in GD, and smaller batch sizes in SGD result in more catapults. Through experiments, it is proved that each catapult leads to a reduction in the norm of the tensor bases, consistent with observations in GD. Furthermore, the paper suggests that smaller batch sizes in SGD can increase the number of catapults, thereby improving the alignment of AGOP and testing performance. The research also discovers that catapults contribute to better generalization by enhancing feature learning, which can be quantified by the alignment between the AGOP of the trained network and the true model AGOP. The paper extends previous work and demonstrates that the testing performance continues to improve with an increase in the number of catapults in GD. Additionally, experiments with different optimization algorithms on multiple network architectures and datasets prove that AGOP alignment is an effective indicator of generalization capability. The contribution of the paper lies in connecting the seemingly unrelated issues of loss spikes in SGD training, the catapult phenomenon in GD, and the improved generalization capability of small-batch SGD, and providing experimental evidence to support these findings.