Abstract:We introduce a clipping strategy for Stochastic Gradient Descent (SGD) which uses quantiles of the gradient norm as clipping thresholds. We prove that this new strategy provides a robust and efficient optimization algorithm for smooth objectives (convex or non-convex), that tolerates heavy-tailed samples (including infinite variance) and a fraction of outliers in the data stream akin to Huber contamination. Our mathematical analysis leverages the connection between constant step size SGD and Markov chains and handles the bias introduced by clipping in an original way. For strongly convex objectives, we prove that the iteration converges to a concentrated distribution and derive high probability bounds on the final estimation error. In the non-convex case, we prove that the limit distribution is localized on a neighborhood with low gradient. We propose an implementation of this algorithm using rolling quantiles which leads to a highly efficient optimization procedure with strong robustness properties, as confirmed by our numerical experiments.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to make the Stochastic Gradient Descent (SGD) algorithm more robust and efficient in the presence of heavy - tailed distributions and outliers in the data stream. Specifically, the paper introduces a quantile - clipped strategy (quantile - clipped SGD, QC - SGD) based on the gradient norm quantile to improve the robustness and efficiency of the optimization algorithm when dealing with smooth objective functions (whether convex or non - convex). This method can tolerate heavy - tailed samples (including cases with infinite variance) and a certain proportion of outliers in the data stream, similar to the situation in the Huber contamination model. ### Main Contributions 1. **Geometric Ergodicity under Strongly Convex Objectives**: For a sufficiently small contamination proportion $\eta$ and an appropriately selected quantile $p$, when the optimization objective is smooth and strongly convex, QC - SGD converges geometrically to a limiting distribution, such that the deviation around the optimal solution reaches the optimal dependence on $\eta$. 2. **Sub - Gaussian Performance in the Absence of Contamination**: In the absence of contamination ($\eta = 0$) and when the objective function is strongly convex, by coordinating the selection of the step size $\beta$ and the quantile $p$, it can be ensured that the limiting distribution is sub - Gaussian with a constant of $O(\sqrt{\beta})$. In the case of contamination ($\eta>0$), the limiting distribution is sub - exponential. 3. **Ergodicity under Non - convex Objectives**: For a smooth but non - convex objective function, if the gradient satisfies a certain identifiability condition, it is proved that the total variation distance between the QC - SGD iteration and the limiting distribution vanishes at a sub - linear rate. In this case, the limiting distribution enables the optimal control of the deviation of the target gradient in terms of $\eta$. 4. **Experimental Verification**: Experimental results are provided, showing that QC - SGD can be easily and efficiently implemented, and the algorithm is kept efficient in terms of memory and complexity through the rolling quantile estimate $Q_p(\|\tilde{G}(\theta_t,\zeta_t)\|)$. The experiments show that the iteration is indeed robust to heavy - tailed distributions and contamination. ### Mathematical Background - **Assumption 1**: The objective function $L$ is $L$-Lipschitz smooth, that is: \[ L(\theta') \leq L(\theta)+\langle\nabla L(\theta),\theta' - \theta\rangle+\frac{L}{2}\|\theta - \theta'\|^2 \] holds for all $\theta,\theta'\in\mathbb{R}^d$, where $L < +\infty$. - **Assumption 2**: The objective function $L$ is $\mu$-strongly convex, that is: \[ L(\theta') \geq L(\theta)+\langle\nabla L(\theta),\theta' - \theta\rangle+\frac{\mu}{2}\|\theta - \theta'\|^2 \] holds for all $\theta,\theta'\in\mathbb{R}^d$, where $\mu > 0$. - **Assumption 3**: The gradient samples $(G(\theta_t,\zeta_t))_{t\geq0}$ are contaminated with probability $\eta < 1/2$, that is: \[ G(\theta_t,\zeta_t)=U_tqG(\theta_t)+(1 - U_t)\tilde{G}(\theta_t,\zeta_t) \] where $U_t$ is an independent and identically distributed parameter of $\eta$.

Robust Stochastic Optimization via Gradient Quantile Clipping

High Probability Analysis for Non-Convex Stochastic Optimization with Clipping

SGD with Clipping is Secretly Estimating the Median Gradient

Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise

On the Convergence of DP-SGD with Adaptive Clipping

A Gradient Smoothed Functional Algorithm with Truncated Cauchy Random Perturbations for Stochastic Optimization

Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness

Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

From Gradient Clipping to Normalization for Heavy Tailed SGD

Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping

Smoothed Gradient Clipping and Error Feedback for Decentralized Optimization under Symmetric Heavy-Tailed Noise

Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tails

The Stochastic Steepest Descent Method for Robust Optimization in Banach Spaces

High Probability Guarantees for Nonconvex Stochastic Gradient Descent with Heavy Tails.

Convergence and concentration properties of constant step-size SGD through Markov chains

A Stochastic Subgradient Method for Distributionally Robust Non-Convex Learning

Gradient Descent for Noisy Optimization

Beyond Convexity: Stochastic Quasi-Convex Optimization

High-Probability Bound for Non-Smooth Non-Convex Stochastic Optimization with Heavy Tails

Stochastic Gradient Descent Revisited

Near-Optimal Streaming Heavy-Tailed Statistical Estimation with Clipped SGD