Abstract:In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $\widetilde{\Theta}(T^{2/3})$ under bandit feedback and improves to $\widetilde{\Theta}(\sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{\mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{\mathrm{ex}} = O(T^{2/3})$, the regret remains $\widetilde{\Theta}(T^{2/3})$, but when $B_{\mathrm{ex}} = \Omega(T^{2/3})$, it becomes $\widetilde{\Theta}(T/\sqrt{B_{\mathrm{ex}}})$, which improves as the budget $B_{\mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $\widetilde{\Theta}(T/\sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.

Bounds on the price of feedback for mistake-bounded online learning

Bandit-Feedback Online Multiclass Classification: Variants and Tradeoffs

Understanding the Role of Feedback in Online Learning with Switching Costs

Online Learning with Feedback Graphs: Beyond Bandits

Combinatorial Bandits with Relative Feedback

Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback

Improved Regret Bounds for Online Kernel Selection under Bandit Feedback

Online Stochastic Linear Optimization under One-bit Feedback

On the price of exact truthfulness in incentive-compatible online learning with bandit feedback: A regret lower bound for WSU-UX

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

Bandits with Switching Costs: T^{2/3} Regret.

Near-Optimal Learning of Extensive-Form Games with Imperfect Information

Best-Case Lower Bounds in Online Learning

Online Learning with Set-Valued Feedback

On Adaptivity in Information-constrained Online Learning

On the Minimax Regret in Online Ranking with Top-k Feedback

Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning

Learning Thresholds with Latent Values and Censored Feedback

Cooperative Online Learning with Feedback Graphs

Near Optimal Memory-Regret Tradeoff for Online Learning

Improved Regret Bounds for Bandits with Expert Advice