H. Brendan McMahan
Abstract:We study three families of online convex optimization algorithms: follow-the-proximally-regularized-leader (FTRL-Proximal), regularized dual averaging (RDA), and composite-objective mirror descent. We first prove equivalence theorems that show all of these algorithms are instantiations of a general FTRL update. This provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles L1 regularization explicitly, it has been observed that RDA is even more effective at producing sparsity. Our results demonstrate that FOBOS uses subgradient approximations to the L1 penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm can be seen as a hybrid of these two, and outperforms both on a large, real-world dataset.
Our second contribution is a unified analysis which produces regret bounds that match (up to logarithmic terms) or improve the best previously known bounds. This analysis also extends these algorithms in two important ways: we support a more general type of composite objective and we analyze implicit updates, which replace the subgradient approximation of the current loss function with an exact optimization.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the relationships among different algorithms and their performance differences in Online Convex Optimization (OCO). Specifically, the author studied three main OCO algorithms: Follow - the - Proximally - Regularized - Leader (FTRL - Proximal), Regularized Dual Averaging (RDA) and Composite - Objective Mirror Descent (COMID), and tried to answer the following questions:
1. **Is there an equivalence among these algorithms?**
- The author proved that all of these algorithms are instances of the generalized FTRL update, thus revealing their internal relationships.
2. **Why do some algorithms perform better when dealing with sparse models?**
- Especially when using L1 regularization, why does RDA produce sparse solutions more effectively than FOBOS?
- The author explained this phenomenon through theoretical analysis, pointing out that FOBOS uses the sub - gradient approximation of L1 penalty, while RDA directly deals with the accumulated L1 penalty, so RDA can better induce sparsity.
3. **How to uniformly analyze these algorithms and improve the existing regret bounds?**
- The author provided a unified analysis framework, which not only matched or improved the previous best regret bounds, but also extended the capabilities of these algorithms, supporting implicit updates and more general composite objective functions.
4. **How to apply these algorithms to practical problems, especially in large - scale machine learning tasks?**
- The author experimentally verified the superior performance of FTRL - Proximal on actual data sets, especially in cases where L1 regularization or other non - smooth regularization is required.
In summary, the main contributions of this paper are:
- Providing the proof of equivalence among the three OCO algorithms and revealing their internal relationships.
- Explaining why some algorithms perform better when dealing with sparse models.
- Proposing a unified analysis framework, improving the existing regret bounds and expanding the application range of the algorithms.
- Experimentally verifying the effectiveness of the new algorithms in practical applications.
Regarding formulas, for example, the definition of L1 regularization is as follows:
\[ \text{L1 regularization} = \lambda \| x \|_1 \]
where \(\lambda\) is the regularization parameter, and \(\| x \|_1\) represents the L1 norm of vector \(x\), that is, the sum of the absolute values of each element.
In addition, the update rule for the FTRL algorithm can be expressed as:
\[ x_{t + 1} = \arg \min_x \left( g_{1:t} \cdot x + \frac{\sigma_{1:t}}{2} \| x \|^2_2 \right) \]
where \(g_{1:t}\) is the accumulated gradient of the loss function in the previous \(t\) rounds, and \(\sigma_{1:t}\) is the accumulated weight of strong convexity.
These formulas and analyses help to understand the behavior and performance differences of different algorithms in online learning.