Abstract:While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the \emph{first} of this kind. Here, $d$ is the feature dimension, and $\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at the $t$-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and $\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

A Priori Estimates for Deep Residual Network in Continuous-time Reinforcement Learning

Reinforcement Learning-Based Control for Nonlinear Discrete-Time Systems with Unknown Control Directions and Control Constraints

Investigating Generalisation in Continuous Deep Reinforcement Learning

LiFE:Deep Exploration Via Linear-Feature Bonus in Continuous Control

Understanding Deep Neural Function Approximation in Reinforcement Learning via $ε$-Greedy Exploration

Controlling Estimation Error in Reinforcement Learning via Reinforced Operation

Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System

Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Maximum Principle Based Algorithms for Deep Learning.

Approximate Policy Iteration With Deep Minimax Average Bellman Error Minimization

A Deep Reinforcement Learning Approach to Rare Event Estimation

An Experimental Comparison Between Temporal Difference and Residual Gradient with Neural Network Approximation

Broad Critic Deep Actor Reinforcement Learning for Continuous Control

Efficient Continuous Control with Double Actors and Regularized Critics

Over-parameterized Deep Nonparametric Regression for Dependent Data with Its Applications to Reinforcement Learning.

Infinite-Horizon Reach-Avoid Zero-Sum Games via Deep Reinforcement Learning

Efficient Exploration in Continuous-time Model-based Reinforcement Learning

Continuous Control with Contexts, Provably

Deep Exploration with PAC-Bayes

Improving Performance in Reinforcement Learning by Breaking Generalization in Neural Networks