Abstract:Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively in recent years. In the single-pass setting with $K$ arms and $T$ trials, a regret lower bound of $\Omega(T^{2/3})$ has been proved for any algorithm with $o(K)$ memory (Maiti et al. [NeurIPS'21]; Agarwal at al. [COLT'22]). On the other hand, however, the previous best regret upper bound is still $O(K^{1/3} T^{2/3}\log^{1/3}(T))$, which is achieved by the streaming implementation of the simple uniform exploration. The $O(K^{1/3}\log^{1/3}(T))$ gap leaves the open question of the tight regret bound in the single-pass MABs with sublinear arm memory. In this paper, we answer this open problem and complete the picture of regret minimization in single-pass streaming MABs. We first improve the regret lower bound to $\Omega(K^{1/3}T^{2/3})$ for algorithms with $o(K)$ memory, which matches the uniform exploration regret up to a logarithm factor in $T$. We then show that the $\log^{1/3}(T)$ factor is not necessary, and we can achieve $O(K^{1/3}T^{2/3})$ regret by finding an $\varepsilon$-best arm and committing to it in the rest of the trials. For regret minimization with high constant probability, we can apply the single-memory $\varepsilon$-best arm algorithms in Jin et al. [ICML'21] to obtain the optimal bound. Furthermore, for the expected regret minimization, we design an algorithm with a single-arm memory that achieves $O(K^{1/3} T^{2/3}\log(K))$ regret, and an algorithm with $O(\log^{*}(n))$-memory with the optimal $O(K^{1/3} T^{2/3})$ regret following the $\varepsilon$-best arm algorithm in Assadi and Wang [STOC'20]. We further tested the empirical performances of our algorithms. The simulation results show that the proposed algorithms consistently outperform the benchmark uniform exploration algorithm by a large margin, and on occasion, reduce the regret by up to 70%.

Regret Bounds for Batched Bandits

Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem

Adaptive Regret for Bandits Made Possible: Two Queries Suffice

Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits

Nash Regret Guarantees for Linear Bandits

Improved Algorithms for Stochastic Linear Bandits Using Tail Bounds for Martingale Mixtures

A General Recipe for the Analysis of Randomized Multi-Armed Bandit Algorithms

Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

Batched Lipschitz Bandits.

Optimal Regret Is Achievable with Bounded Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework

Budget-Constrained Bandits over General Cost and Reward Distributions

Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms or Independent Arms

Refined Lower Bounds for Adversarial Bandits

Finite-Time Logarithmic Bayes Regret Upper Bounds

Lipschitz Bandits with Batched Feedback

Tight Regret Bounds for Single-pass Streaming Multi-armed Bandits

Near-optimal Per-Action Regret Bounds for Sleeping Bandits

Tight Memory-Regret Lower Bounds for Streaming Bandits.

Optimal Regret Bounds for Collaborative Learning in Bandits

Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

Understanding Bandits with Graph Feedback.