Tackling heavy-tailed rewards in reinforcement learning with function approximation: Minimax optimal and instance-dependent regret bounds

Jiayi Huang, Han Zhong, Liwei Wang, Lin Yang
2024-02-13
Abstract:While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are\emph {heavy-tailed}, ie, with only finite -th moments for some . In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm,\textsc {Heavy-OFUL}, for heavy-tailed linear bandits, achieving an\emph {instance-dependent} -round regret of , the\emph {first} of this kind. Here, is the feature dimension, and is the -th central moment of the reward at the -th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as\textsc {Heavy-LSVI-UCB}, achieves the\emph {first} computationally efficient\emph {instance-dependent} -episode regret of . Here, is length of the episode, and are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.
What problem does this paper attempt to address?