Abstract:In $K$-armed dueling bandits, the learner receives preference feedback between arms, and the regret of an arm is defined in terms of its suboptimality to a $\textit{winner}$ arm. The $\textit{non-stationary}$ variant of the problem, motivated by concerns of changing user preferences, has received recent interest (Saha and Gupta, 2022; Buening and Saha, 2023; Suk and Agarwal, 2023). The goal here is to design algorithms with low {\em dynamic regret}, ideally without foreknowledge of the amount of change. The notion of regret here is tied to a notion of winner arm, most typically taken to be a so-called Condorcet winner or a Borda winner. However, the aforementioned results mostly focus on the Condorcet winner. In comparison, the Borda version of this problem has received less attention which is the focus of this work. We establish the first optimal and adaptive dynamic regret upper bound $\tilde{O}(\tilde{L}^{1/3} K^{1/3} T^{2/3} )$, where $\tilde{L}$ is the unknown number of significant Borda winner switches. We also introduce a novel $\textit{weighted Borda score}$ framework which generalizes both the Borda and Condorcet problems. This framework surprisingly allows a Borda-style regret analysis of the Condorcet problem and establishes improved bounds over the theoretical state-of-art in regimes with a large number of arms or many spurious changes in Condorcet winner. Such a generalization was not known and could be of independent interest.

Dueling Bandits with Qualitative Feedback

Dueling Bandits: From Two-dueling to Multi-dueling

Advancements in Dueling Bandits

Multi-dueling Bandits with Dependent Arms

Biased Dueling Bandits with Stochastic Delayed Feedback

Neural Dueling Bandits

Dueling Bandits With Weak Regret

Graph Feedback Bandits with Similar Arms

Batched Dueling Bandits

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

Dueling Bandits with Adversarial Sleeping

Non-Stationary Dueling Bandits Under a Weighted Borda Criterion

Adversarial Multi-dueling Bandits

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

Conversational Dueling Bandits in Generalized Linear Models

Feel-Good Thompson Sampling for Contextual Dueling Bandits

Regret Analysis for Continuous Dueling Bandit

DP-Dueling: Learning from Preference Feedback without Compromising User Privacy

Preference-based Online Learning with Dueling Bandits: A Survey

Non-stationary Dueling Bandits for Online Learning to Rank

Kernelized Offline Contextual Dueling Bandits