Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees

Aleksandar Armacki,Shuhua Yu,Pranay Sharma,Gauri Joshi,Dragana Bajovic,Dusan Jakovetic,Soummya Kar
2024-10-18
Abstract:We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-\zeta})$, where $\zeta \in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex and $p < 8/7$ for strongly convex costs. Experiments validate our theory, demonstrating noise symmetry in real-life settings and showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
This paper attempts to solve the problem of high - probability convergence in online learning in the presence of heavy - tailed noise. Specifically, the paper proposes a general framework for the non - linear Stochastic Gradient Descent (SGD) method, which can handle various popular non - linear transformations, such as sign functions, quantization, component clipping, and joint clipping, etc. The main goal of the paper is to provide unified guarantees for a wide range of non - linear methods in the presence of heavy - tailed noise, without making any assumptions about the noise moments. ### Main research questions: 1. **High - probability convergence**: Study the high - probability convergence of non - linear SGD methods in the presence of heavy - tailed noise. 2. **Symmetric noise**: Pay special attention to the case of symmetric noise, because experiments have proven that gradient noise usually exhibits symmetry when training deep - learning models. 3. **A wide range of non - linear methods**: Provide a general framework that covers multiple non - linear transformations and theoretically prove the effectiveness of these methods. ### Specific problem descriptions: - **Non - convex cost functions**: For non - convex cost functions, the paper establishes the convergence of the squared gradient norm with a convergence rate of \( \mathcal{O}(t^{-1/4}) \). - **Strongly convex cost functions**: For strongly convex cost functions, the paper establishes the convergence of the weighted average iteration with a convergence rate of \( \mathcal{O}(t^{-1/4}) \), and for the last iteration point, the convergence rate is \( \mathcal{O}(t^{-\zeta}) \), where \( \zeta \in (0, 1) \) depends on the noise, non - linearity, and other problem parameters. ### Main contributions of the paper: 1. **General framework**: Propose a general framework for non - linear SGD methods to handle heavy - tailed noise, covering multiple non - linear transformations. 2. **Convergence under symmetric noise**: Under symmetric noise, provide high - probability convergence guarantees for non - convex and strongly convex cost functions, with a convergence rate better than existing methods. 3. **No need for noise moment assumptions**: Do not need to make any assumptions about the noise moments, which makes the theoretical results more widely applicable. 4. **Numerical experiments**: Verify the theoretical results through experiments, showing that in actual data, gradient noise does indeed exhibit symmetry, and different non - linear methods perform differently in different situations, further emphasizing the importance of the general framework. ### Theoretical results: - **Non - convex cost functions**: For non - convex cost functions, the paper proves that the convergence rate of the squared gradient norm is \( \mathcal{O}(t^{-1/4}) \). - **Strongly convex cost functions**: For strongly convex cost functions, the paper proves that the convergence rate of the weighted average iteration is \( \mathcal{O}(t^{-1/4}) \), and for the last iteration point, the convergence rate is \( \mathcal{O}(t^{-\zeta}) \), where \( \zeta \) depends on the noise, non - linearity, and other problem parameters. ### Experimental verification: - **Noise symmetry**: Verify through experiments that when training deep - learning models, gradient noise usually exhibits symmetry. - **Comparison of non - linear methods**: The experimental results show that different non - linear methods perform differently in different situations, further proving the value of the general framework. In conclusion, this paper provides a general and effective non - linear SGD framework in online learning for handling heavy - tailed noise, and verifies its effectiveness through theoretical analysis and experimental verification.