Abstract:Upper Confidence Bound (UCB) algorithms are a widely-used class of sequential algorithms for the $K$-armed bandit problem. Despite extensive research over the past decades aimed at understanding their asymptotic and (near) minimax optimality properties, a precise understanding of their regret behavior remains elusive. This gap has not only hindered the evaluation of their actual algorithmic efficiency, but also limited further developments in statistical inference in sequential data collection. This paper bridges these two fundamental aspects--precise regret analysis and adaptive statistical inference--through a deterministic characterization of the number of arm pulls for an UCB index algorithm [Lai87, Agr95, ACBF02]. Our resulting precise regret formula not only accurately captures the actual behavior of the UCB algorithm for finite time horizons and individual problem instances, but also provides significant new insights into the regimes in which the existing theory remains informative. In particular, we show that the classical Lai-Robbins regret formula is exact if and only if the sub-optimality gaps exceed the order $\sigma\sqrt{K\log T/T}$. We also show that its maximal regret deviates from the minimax regret by a logarithmic factor, and therefore settling its strict minimax optimality in the negative. The deterministic characterization of the number of arm pulls for the UCB algorithm also has major implications in adaptive statistical inference. Building on the seminal work of [Lai82], we show that the UCB algorithm satisfies certain stability properties that lead to quantitative central limit theorems in two settings including the empirical means of unknown rewards in the bandit setting. These results have an important practical implication: conventional confidence sets designed for i.i.d. data remain valid even when data are collected sequentially.

Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets

Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret

Quantum exploration algorithms for multi-armed bandits

Multi-Armed Bandits and Quantum Channel Oracles

Quantum Speedups of Optimizing Approximately Convex Functions with Applications to Logarithmic Regret Stochastic Convex Bandits

Quantum Speedups in Regret Analysis of Infinite Horizon Average-Reward Markov Decision Processes

Quantum Reinforcement Learning for Multi-Armed Bandits

Improved Algorithms for Stochastic Linear Bandits Using Tail Bounds for Martingale Mixtures

Understanding the stochastic dynamics of sequential decision-making processes: A path-integral analysis of multi-armed bandits

Adaptive Regret for Bandits Made Possible: Two Queries Suffice

Tight Rates for Bandit Control Beyond Quadratics

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

A Decentralized Policy with Logarithmic Regret for a Class of Multi-Agent Multi-Armed Bandit Problems with Option Unavailability Constraints and Stochastic Communication Protocols

Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications

On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems

Finite-Time Logarithmic Bayes Regret Upper Bounds

Logarithmic-Regret Quantum Learning Algorithms for Zero-Sum Games

Quantum Reinforcement Learning Method and Application Based on Value Function

UCB algorithms for multi-armed bandits: Precise regret and adaptive inference

Nash Regret Guarantees for Linear Bandits

Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems