Abstract:Upper Confidence Bound (UCB) algorithms are a widely-used class of sequential algorithms for the $K$-armed bandit problem. Despite extensive research over the past decades aimed at understanding their asymptotic and (near) minimax optimality properties, a precise understanding of their regret behavior remains elusive. This gap has not only hindered the evaluation of their actual algorithmic efficiency, but also limited further developments in statistical inference in sequential data collection. This paper bridges these two fundamental aspects--precise regret analysis and adaptive statistical inference--through a deterministic characterization of the number of arm pulls for an UCB index algorithm [Lai87, Agr95, ACBF02]. Our resulting precise regret formula not only accurately captures the actual behavior of the UCB algorithm for finite time horizons and individual problem instances, but also provides significant new insights into the regimes in which the existing theory remains informative. In particular, we show that the classical Lai-Robbins regret formula is exact if and only if the sub-optimality gaps exceed the order $\sigma\sqrt{K\log T/T}$. We also show that its maximal regret deviates from the minimax regret by a logarithmic factor, and therefore settling its strict minimax optimality in the negative. The deterministic characterization of the number of arm pulls for the UCB algorithm also has major implications in adaptive statistical inference. Building on the seminal work of [Lai82], we show that the UCB algorithm satisfies certain stability properties that lead to quantitative central limit theorems in two settings including the empirical means of unknown rewards in the bandit setting. These results have an important practical implication: conventional confidence sets designed for i.i.d. data remain valid even when data are collected sequentially.

Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits

uniINF: Best-of-Both-Worlds Algorithm for Parameter-Free Heavy-Tailed MABs

Optimal Rates of (Locally) Differentially Private Heavy-tailed Multi-Armed Bandits

$(ε, u)$-Adaptive Regret Minimization in Heavy-Tailed Bandits

Multi-Fidelity Multi-Armed Bandits Revisited

Data-Driven Upper Confidence Bounds with Near-Optimal Regret for Heavy-Tailed Bandits

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Adversarial Bandits with Multi-User Delayed Feedback: Theory and Application

Swimming in curved space or The Baron and the cat

Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards

Algorithms for Differentially Private Multi-Armed Bandits

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

Finite Budget Analysis of Multi-Armed Bandit Problems.

Best Arm Identification with Fixed Budget: A Large Deviation Perspective

Minimax-optimal trust-aware multi-armed bandits

Multiarmed Bandits Problem Under the Mean-Variance Setting

Breaking the Moments Condition Barrier: No-Regret Algorithm for Bandits with Super Heavy-Tailed Payoffs

UCB algorithms for multi-armed bandits: Precise regret and adaptive inference