Robust Stochastic Optimization via Gradient Quantile Clipping

Ibrahim Merad,Stéphane Gaïffas
2024-10-12
Abstract:We introduce a clipping strategy for Stochastic Gradient Descent (SGD) which uses quantiles of the gradient norm as clipping thresholds. We prove that this new strategy provides a robust and efficient optimization algorithm for smooth objectives (convex or non-convex), that tolerates heavy-tailed samples (including infinite variance) and a fraction of outliers in the data stream akin to Huber contamination. Our mathematical analysis leverages the connection between constant step size SGD and Markov chains and handles the bias introduced by clipping in an original way. For strongly convex objectives, we prove that the iteration converges to a concentrated distribution and derive high probability bounds on the final estimation error. In the non-convex case, we prove that the limit distribution is localized on a neighborhood with low gradient. We propose an implementation of this algorithm using rolling quantiles which leads to a highly efficient optimization procedure with strong robustness properties, as confirmed by our numerical experiments.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to make the Stochastic Gradient Descent (SGD) algorithm more robust and efficient in the presence of heavy - tailed distributions and outliers in the data stream. Specifically, the paper introduces a quantile - clipped strategy (quantile - clipped SGD, QC - SGD) based on the gradient norm quantile to improve the robustness and efficiency of the optimization algorithm when dealing with smooth objective functions (whether convex or non - convex). This method can tolerate heavy - tailed samples (including cases with infinite variance) and a certain proportion of outliers in the data stream, similar to the situation in the Huber contamination model. ### Main Contributions 1. **Geometric Ergodicity under Strongly Convex Objectives**: For a sufficiently small contamination proportion $\eta$ and an appropriately selected quantile $p$, when the optimization objective is smooth and strongly convex, QC - SGD converges geometrically to a limiting distribution, such that the deviation around the optimal solution reaches the optimal dependence on $\eta$. 2. **Sub - Gaussian Performance in the Absence of Contamination**: In the absence of contamination ($\eta = 0$) and when the objective function is strongly convex, by coordinating the selection of the step size $\beta$ and the quantile $p$, it can be ensured that the limiting distribution is sub - Gaussian with a constant of $O(\sqrt{\beta})$. In the case of contamination ($\eta>0$), the limiting distribution is sub - exponential. 3. **Ergodicity under Non - convex Objectives**: For a smooth but non - convex objective function, if the gradient satisfies a certain identifiability condition, it is proved that the total variation distance between the QC - SGD iteration and the limiting distribution vanishes at a sub - linear rate. In this case, the limiting distribution enables the optimal control of the deviation of the target gradient in terms of $\eta$. 4. **Experimental Verification**: Experimental results are provided, showing that QC - SGD can be easily and efficiently implemented, and the algorithm is kept efficient in terms of memory and complexity through the rolling quantile estimate $Q_p(\|\tilde{G}(\theta_t,\zeta_t)\|)$. The experiments show that the iteration is indeed robust to heavy - tailed distributions and contamination. ### Mathematical Background - **Assumption 1**: The objective function $L$ is $L$-Lipschitz smooth, that is: \[ L(\theta') \leq L(\theta)+\langle\nabla L(\theta),\theta' - \theta\rangle+\frac{L}{2}\|\theta - \theta'\|^2 \] holds for all $\theta,\theta'\in\mathbb{R}^d$, where $L < +\infty$. - **Assumption 2**: The objective function $L$ is $\mu$-strongly convex, that is: \[ L(\theta') \geq L(\theta)+\langle\nabla L(\theta),\theta' - \theta\rangle+\frac{\mu}{2}\|\theta - \theta'\|^2 \] holds for all $\theta,\theta'\in\mathbb{R}^d$, where $\mu > 0$. - **Assumption 3**: The gradient samples $(G(\theta_t,\zeta_t))_{t\geq0}$ are contaminated with probability $\eta < 1/2$, that is: \[ G(\theta_t,\zeta_t)=U_tqG(\theta_t)+(1 - U_t)\tilde{G}(\theta_t,\zeta_t) \] where $U_t$ is an independent and identically distributed parameter of $\eta$.