Abstract:While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the \emph{first} of this kind. Here, $d$ is the feature dimension, and $\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at the $t$-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and $\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

Optimism in Reinforcement Learning with Generalized Linear Function Approximation.

Generalized linear function approximation

Provably Efficient Reinforcement Learning with Linear Function Approximation

Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation.

Optimistic Q-learning for average reward and episodic reinforcement learning

Kernel-Based Function Approximation for Average Reward Reinforcement Learning: An Optimist No-Regret Algorithm

Provably Efficient Reinforcement Learning with General Value Function Approximation.

Nonstationary Reinforcement Learning with Linear Function Approximation

Non-stationary Reinforcement Learning under General Function Approximation

Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency

Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.

Provably Efficient Reinforcement Learning Via Surprise Bound

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Online Sub-Sampling for Reinforcement Learning with General Function Approximation

Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning