Abstract:Recently, reinforcement learning has gained prominence in modern statistics, with policy evaluation being a key component. Unlike traditional machine learning literature on this topic, our work places emphasis on statistical inference for the parameter estimates computed using reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, limiting their applicability, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop an online robust policy evaluation procedure, and establish the limiting distribution of our estimator, based on its Bahadur representation. Furthermore, we develop a fully-online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper bridges the gap between robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in real-world reinforcement learning experiments.
What problem does this paper attempt to address?
This paper attempts to solve the problem of robust policy evaluation in reinforcement learning. Specifically, it focuses on statistical inference, especially on how to provide reliable parameter estimates in the case of outlier contamination and heavy - tailed rewards. Most of the existing analyses assume that random rewards follow a standard distribution, which limits their scope of application. By introducing the concept of robust statistics, this paper proposes an online robust policy evaluation method and establishes the limiting distribution of its estimators. In addition, the author also develops a fully online process to efficiently perform statistical inference based on the asymptotic distribution, thus filling the gap between robust statistics and statistical inference in reinforcement learning and providing a more flexible and reliable method for policy evaluation.
The main contributions of the paper include:
- Proposing an online policy evaluation method that deals with dependent samples and is able to perform fully online statistical inference on model parameters simultaneously.
- Establishing the Bahadur representation of the proposed estimators, including the main term corresponding to the asymptotic normal distribution and the high - order remainder term.
- The proposed algorithm converges faster than typical first - order stochastic methods such as TD learning, and shows significant differences in numerical experiments.
- The algorithm does not need to adjust the step size, and can effectively handle outliers and heavy - tailed rewards, and is suitable for reinforcement learning environments with a large time span.
Through these contributions, the paper aims to improve the reliability and credibility of reinforcement learning in practical applications, especially in fields such as autonomous driving, precision medicine, and autonomous robots, where uncertainty quantification is crucial for the decision - making process.