Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

Nikita Puchkin,Eduard Gorbunov,Nikolay Kutuzov,Alexander Gasnikov
2024-04-17
Abstract:We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $\mathcal{O}(K^{-2(\alpha - 1)/\alpha})$, when the stochastic gradients have finite moments of order $\alpha \in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup.
Optimization and Control,Data Structures and Algorithms,Machine Learning,Statistics Theory
What problem does this paper attempt to address?
This paper aims to solve the bottleneck of convergence rate in stochastic optimization problems with heavy - tailed noise. Specifically, when the stochastic gradient has a finite $\alpha$-th moment ($\alpha\in(1, 2]$), the existing methods can usually only achieve a convergence rate of $O(K^{-2(\alpha - 1)/\alpha})$, where $K$ is the number of iterations. However, this convergence rate will slow down significantly when $\alpha$ is close to 1, and even cannot guarantee convergence when $\alpha = 1$. To overcome this problem, the author proposes a new method to stabilize the stochastic gradient by using the smoothed median of means. This method can generate estimates with negligible bias and controllable variance, so it can be effectively integrated into clipped - stochastic gradient descent (clipped - SGD) and clipped - accelerated stochastic gradient descent (clipped - SSTM), and then derive new high - probability complexity bounds. ### Main Contributions 1. **New Assumption Conditions**: The author introduces a new assumption condition (Assumption 2.1) that describes the structure of the noise, allowing the density of the noise to have a finite $\alpha$-th moment and can include an asymmetric part. 2. **Performance Analysis of Smoothed Median**: The author provides a non - asymptotic performance analysis of the smoothed median, proving that even under heavy - tailed noise, the smoothed median can provide estimates with small bias and controllable variance. 3. **Improved Convergence Rate**: By using the smoothed median, the author achieves a faster convergence rate in clipped - stochastic gradient descent and clipped - accelerated stochastic gradient descent. Specifically, for the smoothed strongly convex problem, the upper bound of the dominant term decays at a rate of $eO(K^{-1})$, which is better than $O(K^{-2(\alpha - 1)/\alpha})$ (when $\alpha < 4/3$). 4. **Symmetric Noise Distribution**: For the symmetric noise distribution, the author obtains a convergence rate that matches the latest results under the bounded variance assumption (up to a logarithmic factor). ### Paper Structure - **Section 2**: Introduce symbols and problem settings. - **Section 3**: Review related work. - **Section 4**: Describe the smoothed median and its properties. - **Section 5**: Present the main results, including the convergence analysis of clipped - stochastic gradient descent and clipped - accelerated stochastic gradient descent. - **Section 6**: Verify the performance of the proposed algorithm through experiments. ### Conclusion By introducing new assumption conditions and using the smoothed median technique, the author successfully breaks through the bottleneck of the convergence rate of heavy - tailed noise in stochastic optimization problems, providing new theoretical support and practical methods for dealing with optimization problems with heavy - tailed noise.