Abstract:We study high-probability convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. In the proposed scenario, the model is updated in an online fashion, as new information is observed, without storing any additional data. To combat the heavy-tailed noise, we consider a general framework of nonlinear stochastic gradient descent (SGD), providing several strong results. First, for non-convex costs and component-wise nonlinearities, we establish a convergence rate arbitrarily close to $\mathcal{O}\left(t^{-\frac{1}{4}}\right)$, whose exponent is independent of noise and problem parameters. Second, for strongly convex costs and component-wise nonlinearities, we establish a rate arbitrarily close to $\mathcal{O}\left(t^{-\frac{1}{2}}\right)$ for the weighted average of iterates, with exponent again independent of noise and problem parameters. Finally, for strongly convex costs and a broader class of nonlinearities, we establish convergence of the last iterate, with a rate $\mathcal{O}\left(t^{-\zeta} \right)$, where $\zeta \in (0,1)$ depends on problem parameters, noise and nonlinearity. As we show analytically and numerically, $\zeta$ can be used to inform the preferred choice of nonlinearity for given problem settings. Compared to state-of-the-art, who only consider clipping, require bounded noise moments of order $\eta \in (1,2]$, and establish convergence rates whose exponents go to zero as $\eta \rightarrow 1$, we provide high-probability guarantees for a much broader class of nonlinearities and symmetric density noise, with convergence rates whose exponents are bounded away from zero, even when the noise has finite first moment only. Moreover, in the case of strongly convex functions, we demonstrate analytically and numerically that clipping is not always the optimal nonlinearity, further underlining the value of our general framework.

Convergence of Gradient Method for A Fully Recurrent Neural Network

Convergence Analysis of Asynchronous Stochastic Recursive Gradient Algorithms

Novel Convergence Results of Adaptive Stochastic Gradient Descents

Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Analysis of Boundedness and Convergence of Online Gradient Method for Two-Layer Feedforward Neural Networks

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

A Recipe for Global Convergence Guarantee in Deep Neural Networks

Convergence of Online Gradient Algorithm with Stochastic Inputs for Pi-Sigma Neural Networks

A global convergence theory for deep ReLU implicit networks via over-parameterization

On the convergence of gradient descent for two layer neural networks

On Convergence of Training Loss Without Reaching Stationary Points

Numerical Analysis for Convergence of a Sample-Wise Backpropagation Method for Training Stochastic Neural Networks

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Convergence Rates of Accelerated Markov Gradient Descent with Applications in Reinforcement Learning

A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise

Analysis of Gradient Vanishing of RNNs and Performance Comparison