Abstract:We study a $K$-armed non-stationary bandit model where rewards change smoothly, as captured by Hölder class assumptions on rewards as functions of time. Such smooth changes are parametrized by a Hölder exponent $\beta$ and coefficient $\lambda$. While various sub-cases of this general model have been studied in isolation, we first establish the minimax dynamic regret rate generally for all $K,\beta,\lambda$. Next, we show this optimal dynamic regret can be attained adaptively, without knowledge of $\beta,\lambda$. To contrast, even with parameter knowledge, upper bounds were only previously known for limited regimes $\beta\leq 1$ and $\beta=2$ (Slivkins, 2014; Krishnamurthy and Gopalan, 2021; Manegueu et al., 2021; Jia et al.,2023). Thus, our work resolves open questions raised by these disparate threads of the literature. We also study the problem of attaining faster gap-dependent regret rates in non-stationary bandits. While such rates are long known to be impossible in general (Garivier and Moulines, 2011), we show that environments admitting a safe arm (Suk and Kpotufe, 2022) allow for much faster rates than the worst-case scaling with $\sqrt{T}$. While previous works in this direction focused on attaining the usual logarithmic regret bounds, as summed over stationary periods, our new gap-dependent rates reveal new optimistic regimes of non-stationarity where even the logarithmic bounds are pessimistic. We show our new gap-dependent rate is tight and that its achievability (i.e., as made possible by a safe arm) has a surprisingly simple and clean characterization within the smooth Hölder class model.

Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits

A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits

Bridging Adversarial and Nonstationary Multi-armed Bandit

Non-Stationary Latent Auto-Regressive Bandits

Learning Contextual Bandits in a Non-stationary Environment

On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Adaptive Smooth Non-Stationary Bandits

Swimming in curved space or The Baron and the cat

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Planning and Learning in Risk-Aware Restless Multi-Arm Bandit Problem

Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

Diminishing Exploration: A Minimalist Approach to Piecewise Stationary Multi-Armed Bandits

Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters

Non-Stationary Bandits with Auto-Regressive Temporal Dependency

Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits

A Framework for Adapting Offline Algorithms to Solve Combinatorial Multi-Armed Bandit Problems with Bandit Feedback

Bandit Learning with Delayed Impact of Actions

A Change-Detection Based Thompson Sampling Framework for Non-Stationary Bandits