Abstract:While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the \emph{first} of this kind. Here, $d$ is the feature dimension, and $\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at the $t$-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and $\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

Generalized linear function approximation

Optimism in Reinforcement Learning with Generalized Linear Function Approximation.

Provably Efficient Reinforcement Learning with Linear Function Approximation

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation.

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

Provably Efficient Reinforcement Learning with General Value Function Approximation.

Non-stationary Reinforcement Learning under General Function Approximation

Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Nonstationary Reinforcement Learning with Linear Function Approximation

Online Sub-Sampling for Reinforcement Learning with General Function Approximation

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

On Reward-Free Reinforcement Learning with Linear Function Approximation

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation

Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation

Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.