Abstract:Upper Confidence Bound (UCB) algorithms are a widely-used class of sequential algorithms for the $K$-armed bandit problem. Despite extensive research over the past decades aimed at understanding their asymptotic and (near) minimax optimality properties, a precise understanding of their regret behavior remains elusive. This gap has not only hindered the evaluation of their actual algorithmic efficiency, but also limited further developments in statistical inference in sequential data collection. This paper bridges these two fundamental aspects--precise regret analysis and adaptive statistical inference--through a deterministic characterization of the number of arm pulls for an UCB index algorithm [Lai87, Agr95, ACBF02]. Our resulting precise regret formula not only accurately captures the actual behavior of the UCB algorithm for finite time horizons and individual problem instances, but also provides significant new insights into the regimes in which the existing theory remains informative. In particular, we show that the classical Lai-Robbins regret formula is exact if and only if the sub-optimality gaps exceed the order $\sigma\sqrt{K\log T/T}$. We also show that its maximal regret deviates from the minimax regret by a logarithmic factor, and therefore settling its strict minimax optimality in the negative. The deterministic characterization of the number of arm pulls for the UCB algorithm also has major implications in adaptive statistical inference. Building on the seminal work of [Lai82], we show that the UCB algorithm satisfies certain stability properties that lead to quantitative central limit theorems in two settings including the empirical means of unknown rewards in the bandit setting. These results have an important practical implication: conventional confidence sets designed for i.i.d. data remain valid even when data are collected sequentially.

Autoregressive Bandits

Non-Stationary Bandits with Auto-Regressive Temporal Dependency

Non-Stationary Latent Auto-Regressive Bandits

A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits

BOF-UCB: A Bayesian-Optimistic Frequentist Algorithm for Non-Stationary Contextual Bandits

Best Arm Identification for Stochastic Rising Bandits

Bandit Learning with Delayed Impact of Actions

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Swimming in curved space or The Baron and the cat

Posterior Sampling via Autoregressive Generation

Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits

Regulating Greed Over Time in Multi-Armed Bandits

Rising Rested Bandits: Lower Bounds and Efficient Algorithms

Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Multi-Armed Bandits with Network Interference

A Simple Approach For Non-Stationary Linear Bandits

UCB algorithms for multi-armed bandits: Precise regret and adaptive inference

Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards

Leveraging Offline Data in Linear Latent Bandits

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback