Abstract:In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $\widetilde{\Theta}(T^{2/3})$ under bandit feedback and improves to $\widetilde{\Theta}(\sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{\mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{\mathrm{ex}} = O(T^{2/3})$, the regret remains $\widetilde{\Theta}(T^{2/3})$, but when $B_{\mathrm{ex}} = \Omega(T^{2/3})$, it becomes $\widetilde{\Theta}(T/\sqrt{B_{\mathrm{ex}}})$, which improves as the budget $B_{\mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $\widetilde{\Theta}(T/\sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.

Regret Analysis for Continuous Dueling Bandit

Dueling Bandits: From Two-dueling to Multi-dueling

Bandits with Switching Costs: T^{2/3} Regret.

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

Dueling Bandits With Weak Regret

Non-Stationary Dueling Bandits Under a Weighted Borda Criterion

Adversarial Multi-dueling Bandits

Batched Dueling Bandits

An Asymptotically Optimal Batched Algorithm for the Dueling Bandit Problem

Biased Dueling Bandits with Stochastic Delayed Feedback

Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources

Improved Regret for Bandit Convex Optimization with Delayed Feedback

Adversarial Combinatorial Bandits with Switching Costs

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Understanding the Role of Feedback in Online Learning with Switching Costs

Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

Tight Rates for Bandit Control Beyond Quadratics

Copeland Dueling Bandits

Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits.