Abstract:Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures overlooked by previous studies and present its theoretical implications for training of DNNs. While previous studies believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients in deep learning.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two aspects: 1. **Statistical Tests of Random Gradient Distributions**: - The paper first explores the distribution characteristics of random gradients and their noise in terms of parameter dimensions and the number of iterations. Through formal statistical tests, the author reveals that gradients in dimensions usually exhibit power - law heavy - tailed characteristics, while gradient noise in iterations (caused by mini - batch training) usually does not exhibit power - law heavy - tailed characteristics but is closer to a Gaussian distribution. - This finding helps to reconcile the controversy in previous studies regarding whether random gradient noise has heavy - tailed characteristics. Some studies believe that random gradient noise has heavy - tailed characteristics, while others provide contrary evidence. This paper resolves this contradiction by clearly distinguishing between "gradients in dimensions" and "gradient noise in iterations". 2. **Covariance Spectrum Structure of Random Gradients**: - The paper further discovers that the covariance spectrum of random gradients exhibits a power - law structure in deep learning, which is an important property overlooked in previous studies. - The author also explores the relationship between gradient covariance and the Hessian matrix and finds that the two deviate significantly in some cases, challenging the traditional view that gradient covariance approximates the Hessian matrix near the minimum. - This finding not only explains why random gradients exhibit power - law characteristics in deep learning but also provides new theoretical insights and explains the existence of low - dimensional and robust learning spaces. ### Main Contributions 1. **Results of Statistical Tests**: - Through formal statistical tests, the author finds that gradients in dimensions usually exhibit power - law heavy - tailed characteristics, while gradient noise in iterations usually does not exhibit power - law heavy - tailed characteristics but is closer to a Gaussian distribution. - This finding helps to reconcile the controversy regarding whether random gradient noise has heavy - tailed characteristics. 2. **Power - Law Covariance Spectrum**: - It is discovered that the covariance spectrum of random gradients exhibits a power - law structure in deep learning, which is an important property overlooked in previous studies. - The relationship between gradient covariance and the Hessian matrix is explored, and it is found that the two deviate significantly in some cases, challenging the traditional view. 3. **Low - Dimensional and Robust Learning Spaces**: - Through mathematical analysis, the existence of low - dimensional and robust learning spaces in deep learning is explained. - It is proposed that the eigenvalue gaps (eigengaps) of gradient covariance can explain the robustness of the learning space. ### Experimental Verification - The author verifies these findings through extensive experiments, including experimental results under different batch sizes and different label noise conditions. - The experimental results show that the power - law covariance is普遍存在 (exists generally) under different conditions and has an inverse relationship with the batch size. In conclusion, through formal statistical tests and in - depth theoretical analysis, this paper reveals new characteristics of random gradients in deep learning and provides a new perspective for understanding the success of random optimization methods in deep learning.

On the Overlooked Structure of Stochastic Gradients

Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Multiplicative noise and heavy tails in stochastic optimization

Bound Analysis of Natural Gradient Descent in Stochastic Optimization Setting

Stochastic Gradient Descent and Anomaly of Variance-flatness Relation in Artificial Neural Networks

A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.

The Noise Geometry of Stochastic Gradient Descent: A Quantitative and Analytical Characterization

Emergence of heavy tails in homogenized stochastic gradient descent

Understanding Stochastic Optimization Behavior at the Layer Update Level (Student Abstract)

Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

Novel Convergence Results of Adaptive Stochastic Gradient Descents

Deep learning: a statistical viewpoint

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning

Stochastic collapse: how gradient noise attracts SGD dynamics towards simpler subnetworks*

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms