Abstract:In this work, we investigate the mechanism underlying loss spikes observed during neural network training. When the training enters a region with a lower-loss-as-sharper (LLAS) structure, the training becomes unstable, and the loss exponentially increases once the loss landscape is too sharp, resulting in the rapid ascent of the loss spike. The training stabilizes when it finds a flat region. From a frequency perspective, we explain the rapid descent in loss as being primarily influenced by low-frequency components. We observe a deviation in the first eigendirection, which can be reasonably explained by the frequency principle, as low-frequency information is captured rapidly, leading to the rapid descent. Inspired by our analysis of loss spikes, we revisit the link between the maximum eigenvalue of the loss Hessian ($\lambda_{\mathrm{max}}$), flatness and generalization. We suggest that $\lambda_{\mathrm{max}}$ is a good measure of sharpness but not a good measure for generalization. Furthermore, we experimentally observe that loss spikes can facilitate condensation, causing input weights to evolve towards the same direction. And our experiments show that there is a correlation (similar trend) between $\lambda_{\mathrm{max}}$ and condensation. This observation may provide valuable insights for further theoretical research on the relationship between loss spikes, $\lambda_{\mathrm{max}}$, and generalization.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to explore the mechanism of the loss spike phenomenon that occurs during the training process of neural networks. Specifically, the paper focuses on the following points: 1. **Causes of loss spikes**: When training enters a low - loss but sharper region (referred to as the "Low - Loss Sharper" structure, LLAS), the training becomes unstable and the loss will increase rapidly. Once the loss landscape becomes too sharp, the loss value will rise exponentially, forming a loss spike. Only when the training finds a flat region will the loss value stabilize. 2. **Explanation from the frequency perspective**: From the frequency perspective, the rapid decrease in the loss value is mainly influenced by low - frequency components. The paper finds that during the rising phase of the loss spike, the deviation is mainly dominated by low - frequency components. According to the frequency principle, low - frequency information converges faster than high - frequency information, which explains why the loss value can decrease so rapidly. 3. **Relationship between the maximum eigenvalue and generalization**: The paper re - examines the relationship between the maximum eigenvalue (λmax) and flatness and generalization. Although λmax is a good indicator for measuring the sharpness of the loss landscape, it is not a good indicator for measuring generalization. Experiments have observed that loss spikes can promote the condensation phenomenon, that is, the input weights of different neurons in the same layer evolve in the same direction, which may reduce the effective scale of the network and help improve the generalization performance. 4. **Importance of low - frequency information**: In actual datasets, low - frequency information usually dominates and is well captured by training data and test data. Therefore, the training process can learn low - frequency information well. Since the sharpest direction (indicated by the maximum eigenvalue of the loss Hessian) is closely related to low - frequency information, there is not much difference between solutions with good generalization performance and those with poor generalization performance in the sharpest direction. ### Main contributions 1. **Analysis of the loss spike phenomenon and its frequency mechanism**: The paper analyzes in detail the phenomenon of loss spikes and explains its mechanism from the frequency perspective. 2. **Proposing the LLAS structure**: Explains the mechanism during the rising phase of loss spikes. 3. **Re - examining the relationship between flatness and generalization from the frequency perspective**: Proposes a new perspective to understand the relationship between flatness and generalization. 4. **Preliminarily revealing the correlation between loss spikes, the maximum eigenvalue, and the condensation phenomenon**: Through experimental observations, reveals the connections between these phenomena. ### Conclusion Through in - depth research on the loss spike phenomenon, the paper not only provides new insights into its mechanism but also proposes a new understanding of the relationship between flatness and generalization in neural network training. These findings are of great significance for optimizing the neural network training process and improving the generalization performance of the model.

Loss Spike in Training Neural Networks

On Multi-Stage Loss Dynamics in Neural Networks: Mechanisms of Plateau and Descent Stages

Spike No More: Stabilizing the Pre-training of Large Language Models

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Understanding Edge-of-Stability Training Dynamics with a Minimalist Example

The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin

Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment

Visualizing the Loss Landscape of Neural Nets

High dimensional analysis reveals conservative sharpening and a stochastic edge of stability

Generalization for Least Squares Regression With Simple Spiked Covariances

Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

Asymmetric Valleys: Beyond Sharp and Flat Local Minima.

A simple connection from loss flatness to compressed representations in neural networks

Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

The instabilities of large learning rate training: a loss landscape view

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes.

Exploring the Geometry and Topology of Neural Network Loss Landscapes

Stabilizing Spiking Neuron Training

Inconsistency, Instability, and Generalization Gap of Deep Neural Network Training

Exploring Loss Functions for Time-based Training Strategy in Spiking Neural Networks.