Abstract:A quadratic approximation of neural network loss landscapes has been extensively used to study the optimization process of these networks. Though, it usually holds in a very small neighborhood of the minimum, it cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of a good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon [5] observed for the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth and multiple separate scales.
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: **To study the structure of the neural network loss function and its impact on the optimization process beyond local quadratic approximation**. Specifically, the author focuses on the multiscale structure exhibited by the neural network loss function in regions far from local minima, which cannot be explained by traditional quadratic approximation.
### Main Problem Background
1. **Limitations of Local Quadratic Approximation**:
- Traditional research methods usually use local quadratic approximation to analyze the optimization behavior of neural networks, and this method works well near local minima.
- However, quadratic approximation can only explain phenomena within a very small neighborhood and cannot explain many complex behaviors observed during the optimization process, such as the effects of Edge of Stability (EoS) and Learning Rate Decay (LRD).
2. **Edge of Stability (EoS) Phenomenon**:
- When training a neural network using Gradient Descent (GD), the largest eigenvalue of the Hessian matrix (i.e., sharpness) will increase until it reaches \( \frac{2}{\eta} \), where \( \eta \) is the learning rate.
- Even after the sharpness stabilizes, the training loss continues to decline. This phenomenon cannot be explained by quadratic approximation.
3. **Effect of Learning Rate Decay (LRD)**:
- LRD not only helps to find parameters with lower training losses but may also improve generalization performance, especially when the decay occurs at an appropriate time.
- This phenomenon also cannot be explained by quadratic approximation because LRD in a quadratic loss function will only slow down convergence but will not change the final solution or generalization performance.
### Main Contributions of the Paper
1. **Visualizing the Loss Landscape**:
- The author visualized the neural network loss landscape in regions that cannot be approximated by second - order Taylor polynomials and discovered the multiscale structure of the loss function, including subquadratic growth near minima and separate scales in larger regions.
2. **Explaining the Edge of Stability Phenomenon**:
- Using the subquadratic growth property, the author theoretically explained the Edge of Stability phenomenon, that is, when the learning rate is too large, Gradient Descent will oscillate to a certain extent rather than diverge.
3. **Understanding the Working Principle of Learning Rate Decay**:
- Through the separate scales structure, the author explained in detail the mechanism of Learning Rate Decay in the optimization process, even for deterministic Gradient Descent algorithms.
4. **Constructing a Simple Neural Network Model**:
- The author proposed a simple two - layer neural network model, showing how non - uniform training data can lead to the multiscale structure of the loss landscape, thus revealing the origin of complex loss landscapes.
### Conclusion
By going beyond local quadratic approximation, the paper deeply studied the multiscale structure of the neural network loss function and explained important phenomena such as Edge of Stability and Learning Rate Decay. These findings provide a new perspective and theoretical support for understanding the neural network optimization process.