Regret Bound by Variation for Online Convex Optimization

Tianbao Yang,Mehrdad Mahdavi,Rong Jin,Shenghuo Zhu
DOI: https://doi.org/10.48550/arXiv.1111.6337
2012-06-14
Abstract:In citep{Hazan-2008-extract}, the authors showed that the regret of online linear optimization can be bounded by the total variation of the cost vectors. In this paper, we extend this result to general online convex optimization. We first analyze the limitations of the algorithm in \citep{Hazan-2008-extract} when applied it to online convex optimization. We then present two algorithms for online convex optimization whose regrets are bounded by the variation of cost functions. We finally consider the bandit setting, and present a randomized algorithm for online bandit convex optimization with a variation-based regret bound. We show that the regret bound for online bandit convex optimization is optimal when the variation of cost functions is independent of the number of trials.
Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is how to constrain the regret value through the change of the cost function in Online Convex Optimization (OCO). Specifically, the authors hope to develop algorithms that can limit the regret value according to the amount of change in the cost function. This goal aims to improve existing methods so that the performance of learning algorithms is more robust when facing changing cost functions. ### Background and Motivation Online convex optimization is an iterative decision - making process, in which the decision - maker needs to select a decision vector \( \mathbf{x}_t \) in each round, and then receives a convex cost function \( c_t(\mathbf{x}) \) and bears the corresponding cost \( c_t(\mathbf{x}_t) \). The goal is to minimize the cumulative regret value, that is: \[ \text{regret} = \sum_{t = 1}^T c_t(\mathbf{x}_t)-\min_{\mathbf{x}\in P}\sum_{t = 1}^T c_t(\mathbf{x}) \] Previous research has mainly focused on constraining the regret value as a function of the number of trials \( T \), but this method performs poorly when facing cost functions with large changes. Therefore, this paper proposes a new idea: constraining the regret value through the amount of change in the cost function. ### Main Contributions 1. **Analysis of the Limitations of the FTRL Algorithm**: The authors first analyze the limitations of the Follow the Regularized Leader (FTRL) algorithm when applied to online convex optimization and point out that directly applying FTRL may not achieve the desired effect. In particular, when all cost functions are the same or change very little, the regret value of FTRL may still grow at a rate of \( O(\sqrt{T}) \). 2. **Introduction of Sequential Variation**: In order to better measure the change of the cost function, the authors introduce the sequential variation, which is defined as: \[ \text{VAR}_s^T=\sum_{t = 1}^{T - 1}\max_{\mathbf{x}\in P}\|\nabla c_{t + 1}(\mathbf{x})-\nabla c_t(\mathbf{x})\|_2^2 \] This definition more accurately reflects the change in the gradient of the cost function between adjacent rounds. 3. **Proposing Two New Algorithms**: - **Improved FTRL Algorithm**: By maintaining two sequences (decision vectors and search vectors) and using the extended sequential variation (EVAR_s^T) to constrain the regret value. - **Prox Method (Based on Mirror Approximation)**: Also maintain two sequences and achieve the constraint of the regret value through a similar mechanism. 4. **Theoretical Results**: The authors prove that the regret values of these two new algorithms can be constrained by the sequential variation, in the specific form of: \[ \sum_{t = 1}^T c_t(\mathbf{x}_t)-\min_{\mathbf{x}\in P}\sum_{t = 1}^T c_t(\mathbf{x})\leq O(\sqrt{\text{EVAR}_s^T})+\text{constant} \] ### Conclusions and Prospects This research proposes two new online convex optimization algorithms that can constrain the regret value based on the amount of change in the cost function. This not only improves the robustness of the algorithms when facing changing cost functions but also provides a new direction for further research. Future work can explore how to extend these algorithms to situations with partial feedback (such as bandit settings) and how to reduce the dependence of the regret value on the number of trials \( T \).