Abstract:Recently, reinforcement learning has gained prominence in modern statistics, with policy evaluation being a key component. Unlike traditional machine learning literature on this topic, our work places emphasis on statistical inference for the parameter estimates computed using reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, limiting their applicability, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop an online robust policy evaluation procedure, and establish the limiting distribution of our estimator, based on its Bahadur representation. Furthermore, we develop a fully-online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper bridges the gap between robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in real-world reinforcement learning experiments.

What problem does this paper attempt to address?

This paper attempts to solve the problem of robust policy evaluation in reinforcement learning. Specifically, it focuses on statistical inference, especially on how to provide reliable parameter estimates in the case of outlier contamination and heavy - tailed rewards. Most of the existing analyses assume that random rewards follow a standard distribution, which limits their scope of application. By introducing the concept of robust statistics, this paper proposes an online robust policy evaluation method and establishes the limiting distribution of its estimators. In addition, the author also develops a fully online process to efficiently perform statistical inference based on the asymptotic distribution, thus filling the gap between robust statistics and statistical inference in reinforcement learning and providing a more flexible and reliable method for policy evaluation. The main contributions of the paper include: - Proposing an online policy evaluation method that deals with dependent samples and is able to perform fully online statistical inference on model parameters simultaneously. - Establishing the Bahadur representation of the proposed estimators, including the main term corresponding to the asymptotic normal distribution and the high - order remainder term. - The proposed algorithm converges faster than typical first - order stochastic methods such as TD learning, and shows significant differences in numerical experiments. - The algorithm does not need to adjust the step size, and can effectively handle outliers and heavy - tailed rewards, and is suitable for reinforcement learning environments with a large time span. Through these contributions, the paper aims to improve the reliability and credibility of reinforcement learning in practical applications, especially in fields such as autonomous driving, precision medicine, and autonomous robots, where uncertainty quantification is crucial for the decision - making process.

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning

Online Policy Optimization for Robust MDP

A Bayesian Approach to Robust Inverse Reinforcement Learning

Robust Offline Reinforcement Learning from Low-Quality Data

Estimation and Inference in Distributional Reinforcement Learning

Robust Offline Reinforcement learning with Heavy-Tailed Rewards

Robust Reinforcement Learning using Offline Data

Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness

Robust Control of Uncertain Linear Systems Based on Reinforcement Learning Principles.

Robust Offline Actor-Critic with On-Policy Regularized Policy Evaluation

Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning

Matrix Estimation for Offline Reinforcement Learning with Low-Rank Structure

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Reliable Off-policy Evaluation for Reinforcement Learning

Explicit Lipschitz Value Estimation Enhances Policy Robustness Against Perturbation

Online Policy Learning and Inference by Matrix Completion

Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation

Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data