Variance reduction techniques for stochastic proximal point algorithms

Cheik Traoré,Vassilis Apidopoulos,Saverio Salzo,Silvia Villa
2024-08-06
Abstract:In the context of finite sums minimization, variance reduction techniques are widely used to improve the performance of state-of-the-art stochastic gradient methods. Their practical impact is clear, as well as their theoretical properties. Stochastic proximal point algorithms have been studied as an alternative to stochastic gradient algorithms since they are more stable with respect to the choice of the step size. However, their variance-reduced versions are not as well studied as the gradient ones. In this work, we propose the first unified study of variance reduction techniques for stochastic proximal point algorithms. We introduce a generic stochastic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants. For this algorithm, in the smooth setting, we provide several convergence rates for the iterates and the objective function values, which are faster than those of the vanilla stochastic proximal point algorithm. More specifically, for convex functions, we prove a sublinear convergence rate of $O(1/k)$. In addition, under the Polyak-Łojasiewicz (PL) condition, we obtain linear convergence rates. Finally, our numerical experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts in terms of the stability with respect to the choice of the step size in most cases, especially for difficult problems.
Optimization and Control,Machine Learning
What problem does this paper attempt to address?
This paper aims to apply the variance reduction techniques in finite - sum optimization problems to the Stochastic Proximal Point Algorithm (SPPA). Specifically, the paper proposes a unified variance reduction technique framework to improve the performance of SPPA. This framework can generate proximal - version algorithms similar to SVRG (Stochastic Variance Reduced Gradient), SAGA (Stochastic Average Gradient Algorithm) and their variants. Through this method, the authors not only improve the convergence speed of the algorithm in the smooth setting, but also prove the sub - linear convergence rate \(O(1/k)\) for convex functions and the linear convergence rate under the Polyak - Łojasiewicz (PL) condition. In addition, numerical experiments show that the proposed proximal variance reduction method is more stable than the gradient method in most cases, especially when dealing with difficult problems, and is less sensitive to the choice of step size. ### Background of the Paper and Problem Definition In machine learning and deep learning, a common optimization problem is Empirical Risk Minimization (ERM), whose goal is to minimize the objective function in the following form: \[ \min_{x \in H} F(x)=\frac{1}{n} \sum_{i = 1}^{n} f_i(x), \] where \(H\) is a separable Hilbert space, \(f_i: H\rightarrow\mathbb{R}\) is a loss function, \(n\) is the number of data points, and \(x\in H\) contains model parameters. Due to the existence of large - scale data sets, using the traditional Gradient Descent (GD) for optimization is very expensive in terms of both computation and storage. Therefore, in recent years, various variants of Stochastic Gradient Descent (SGD) have been proposed to solve this problem. However, the convergence speed of SGD is usually slower than that of deterministic GD and is very sensitive to the choice of step size. ### Stochastic Proximal Point Algorithm (SPPA) As an alternative, the Stochastic Proximal Point Algorithm (SPPA) has attracted attention because of its stability in choosing step sizes. SPPA uses the proximal operator of each \(f_i\) instead of the gradient for iterative updates. However, there are relatively few studies on variance reduction techniques for SPPA. ### Variance Reduction Techniques Variance reduction techniques (such as SVRG and SAGA) enable the algorithm to recover the convergence speed of standard GD by reducing the variance of stochastic gradient estimates. These techniques have been widely studied in SGD, but their application to SPPA is relatively limited. ### Contributions of the Paper 1. **Unified Variance Reduction Technique**: The paper proposes a unified variance reduction technique framework applicable to SPPA. This framework can generate proximal - version algorithms similar to SVRG, SAGA and L - SVRG. 2. **Improved Convergence Rates**: In the smooth setting, the paper proves improved convergence rates. For convex functions, it proves a sub - linear convergence rate of \(O(1/k)\); under the PL condition, it proves a linear convergence rate. 3. **Numerical Experiments**: The experimental results show that the proposed proximal variance reduction method is more stable than the gradient method in most cases, especially when dealing with difficult problems, and is less sensitive to the choice of step size. ### Conclusion This paper significantly improves the performance of SPPA by proposing a unified variance reduction technique framework. This not only expands the application range of variance reduction techniques but also provides a new tool for dealing with large - scale optimization problems.