Abstract:We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax bounds, we show that the eluder dimension $d_\text{elu}$$-$a complexity measure of the function class$-$plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of $\Omega(\sqrt{\min\{A,d_\text{elu}\}\Lambda}+d_\text{elu})$ is unavoidable when $d_{\text{elu}}\leq\sqrt{AT}$, where $A$ is the number of actions, $T$ is the total number of rounds, and $\Lambda$ is the total variance over $T$ rounds. For the $A\leq d_\text{elu}$ regime, we derive a nearly matching upper bound $\tilde{O}(\sqrt{A\Lambda}+d_\text{elu})$ for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of $\Omega(\sqrt{d_\text{elu}\Lambda}+d_\text{elu})$ is unavoidable when $\sqrt{d_\text{elu}\Lambda}+d_\text{elu}\leq\sqrt{AT}$. In this setting, we provide an upper bound of order $\tilde{O}(d_\text{elu}\sqrt{\Lambda}+d_\text{elu})$. Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound $\tilde{O}(\sqrt{d_\text{elu}\Lambda}+d_\text{elu})$ established in their work is unimprovable when $\sqrt{d_{\text{elu}}\Lambda}+d_\text{elu}\leq\sqrt{AT}$. However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of $\tilde{O}(\sqrt{A\Lambda}+d_\text{elu})$.

Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency

Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP

Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP

Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP

Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments

How Does Variance Shape the Regret in Contextual Bandits?

Variance-Dependent Regret Bounds for Non-stationary Linear Bandits

Variance-Aware Sparse Linear Bandits.

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

Noise-Adaptive Confidence Sets for Linear Bandits and Application to Bayesian Optimization

Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

Improved Algorithms for Stochastic Linear Bandits Using Tail Bounds for Martingale Mixtures

Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes

Adaptive Regret for Bandits Made Possible: Two Queries Suffice

Smooth Contextual Bandits: Bridging the Parametric and Non-differentiable Regret Regimes

Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

Contextual Continuum Bandits: Static Versus Dynamic Regret

Second Order Bounds for Contextual Bandits with Function Approximation

Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits