Abstract:We obtain essentially tight upper bounds for a strengthened notion of regret in the stochastic linear bandits framework. The strengthening -- referred to as Nash regret -- is defined as the difference between the (a priori unknown) optimum and the geometric mean of expected rewards accumulated by the linear bandit algorithm. Since the geometric mean corresponds to the well-studied Nash social welfare (NSW) function, this formulation quantifies the performance of a bandit algorithm as the collective welfare it generates across rounds. NSW is known to satisfy fairness axioms and, hence, an upper bound on Nash regret provides a principled fairness guarantee. We consider the stochastic linear bandits problem over a horizon of $T$ rounds and with set of arms ${X}$ in ambient dimension $d$. Furthermore, we focus on settings in which the stochastic reward -- associated with each arm in ${X}$ -- is a non-negative, $\nu$-sub-Poisson random variable. For this setting, we develop an algorithm that achieves a Nash regret of $O\left( \sqrt{\frac{d\nu}{T}} \log( T |X|)\right)$. In addition, addressing linear bandit instances in which the set of arms ${X}$ is not necessarily finite, we obtain a Nash regret upper bound of $O\left( \frac{d^\frac{5}{4}\nu^{\frac{1}{2}}}{\sqrt{T}} \log(T)\right)$. Since bounded random variables are sub-Poisson, these results hold for bounded, positive rewards. Our linear bandit algorithm is built upon the successive elimination method with novel technical insights, including tailored concentration bounds and the use of sampling via John ellipsoid in conjunction with the Kiefer-Wolfowitz optimal design.

Lipschitz Bandits with Batched Feedback

Batched Lipschitz Bandits.

Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards

Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks

Understanding Bandits with Graph Feedback.

Optimal Batched Linear Bandits

Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits

Batched Stochastic Bandit for Nondegenerate Functions

Towards Practical Lipschitz Bandits

Online Stochastic Linear Optimization under One-bit Feedback

Adversarial Combinatorial Bandits with Switching Costs

Tight Bounds for Bandit Combinatorial Optimization

Adaptive Regret for Bandits Made Possible: Two Queries Suffice

Sequential Batch Learning in Finite-Action Linear Contextual Bandits

Improved Regret for Bandit Convex Optimization with Delayed Feedback

Nash Regret Guarantees for Linear Bandits

Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

Stochastic Bandits with Graph Feedback in Non-Stationary Environments

Batched Nonparametric Contextual Bandits

Improved Algorithms for Bandit with Graph Feedback via Regret Decomposition

Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback