Abstract:We study three families of online convex optimization algorithms: follow-the-proximally-regularized-leader (FTRL-Proximal), regularized dual averaging (RDA), and composite-objective mirror descent. We first prove equivalence theorems that show all of these algorithms are instantiations of a general FTRL update. This provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles L1 regularization explicitly, it has been observed that RDA is even more effective at producing sparsity. Our results demonstrate that FOBOS uses subgradient approximations to the L1 penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm can be seen as a hybrid of these two, and outperforms both on a large, real-world dataset. Our second contribution is a unified analysis which produces regret bounds that match (up to logarithmic terms) or improve the best previously known bounds. This analysis also extends these algorithms in two important ways: we support a more general type of composite objective and we analyze implicit updates, which replace the subgradient approximation of the current loss function with an exact optimization.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the relationships among different algorithms and their performance differences in Online Convex Optimization (OCO). Specifically, the author studied three main OCO algorithms: Follow - the - Proximally - Regularized - Leader (FTRL - Proximal), Regularized Dual Averaging (RDA) and Composite - Objective Mirror Descent (COMID), and tried to answer the following questions: 1. **Is there an equivalence among these algorithms?** - The author proved that all of these algorithms are instances of the generalized FTRL update, thus revealing their internal relationships. 2. **Why do some algorithms perform better when dealing with sparse models?** - Especially when using L1 regularization, why does RDA produce sparse solutions more effectively than FOBOS? - The author explained this phenomenon through theoretical analysis, pointing out that FOBOS uses the sub - gradient approximation of L1 penalty, while RDA directly deals with the accumulated L1 penalty, so RDA can better induce sparsity. 3. **How to uniformly analyze these algorithms and improve the existing regret bounds?** - The author provided a unified analysis framework, which not only matched or improved the previous best regret bounds, but also extended the capabilities of these algorithms, supporting implicit updates and more general composite objective functions. 4. **How to apply these algorithms to practical problems, especially in large - scale machine learning tasks?** - The author experimentally verified the superior performance of FTRL - Proximal on actual data sets, especially in cases where L1 regularization or other non - smooth regularization is required. In summary, the main contributions of this paper are: - Providing the proof of equivalence among the three OCO algorithms and revealing their internal relationships. - Explaining why some algorithms perform better when dealing with sparse models. - Proposing a unified analysis framework, improving the existing regret bounds and expanding the application range of the algorithms. - Experimentally verifying the effectiveness of the new algorithms in practical applications. Regarding formulas, for example, the definition of L1 regularization is as follows: \[ \text{L1 regularization} = \lambda \| x \|_1 \] where \(\lambda\) is the regularization parameter, and \(\| x \|_1\) represents the L1 norm of vector \(x\), that is, the sum of the absolute values of each element. In addition, the update rule for the FTRL algorithm can be expressed as: \[ x_{t + 1} = \arg \min_x \left( g_{1:t} \cdot x + \frac{\sigma_{1:t}}{2} \| x \|^2_2 \right) \] where \(g_{1:t}\) is the accumulated gradient of the loss function in the previous \(t\) rounds, and \(\sigma_{1:t}\) is the accumulated weight of strong convexity. These formulas and analyses help to understand the behavior and performance differences of different algorithms in online learning.

A Unified View of Regularized Dual Averaging and Mirror Descent with Implicit Updates

Parameter-free Mirror Descent

Inexact Online Proximal Mirror Descent for time-varying composite optimization

Adaptive Mirror Descent Bilevel Optimization

A Mirror Descent Perspective of Smoothed Sign Descent

Equivalence Analysis between Counterfactual Regret Minimization and Online Mirror Descent

Non-convex online learning via algorithmic equivalence

Improving Dynamic Regret in Distributed Online Mirror Descent Using Primal and Dual Information

A generalization of regularized dual averaging and its dynamics

Boosting Data-Driven Mirror Descent with Randomization, Equivariance, and Acceleration

Investigating Variance Definitions for Mirror Descent with Relative Smoothness

Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization

Sparse Q-learning with Mirror Descent

A Unified Approach to Controlling Implicit Regularization via Mirror Descent

The Information Geometry of Mirror Descent

Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence

Mirror Descent Algorithms with Nearly Dimension-Independent Rates for Differentially-Private Stochastic Saddle-Point Problems

Faster Margin Maximization Rates for Generic and Adversarially Robust Optimization Methods

Efficient Adaptive Online Learning Via Frequent Directions

Analysis Accelerated Mirror Descent via High-resolution ODEs

Mirror Duality in Convex Optimization