Abstract:In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted $\eta^\pi$) attained by a given policy $\pi$. We use the certainty-equivalence method to construct our estimator $\hat\eta^\pi$, given a generative model is available. In this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2p}(1-\gamma)^{2p+2}}\right)$ to guarantee the $p$-Wasserstein metric between $\hat\eta^\pi$ and $\eta^\pi$ less than $\varepsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2}(1-\gamma)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hat\eta^\pi$ and $\eta^\pi$ is below $\varepsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hat\eta^\pi$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hat\eta^\pi-\eta^\pi)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{\text{W}})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\text{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\text{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $\eta^\pi$.

Policy Evaluation in Distributional LQR (Extended Version)

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Distributional Reinforcement Learning With Quantile Regression

Value-Distributional Model-Based Reinforcement Learning

Distributional Reinforcement Learning for Multi-Dimensional Reward Functions

Normality-Guided Distributional Reinforcement Learning for Continuous Control

On solutions of the distributional Bellman equation

Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation

Estimation and Inference in Distributional Reinforcement Learning

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems

How Does Value Distribution in Distributional Reinforcement Learning Help Optimization?

Single-Trajectory Distributionally Robust Reinforcement Learning

On the Foundation of Distributionally Robust Reinforcement Learning

Fully Parameterized Quantile Function for Distributional Reinforcement Learning.

Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

How Does Return Distribution in Distributional Reinforcement Learning Help Optimization?

One-Step Distributional Reinforcement Learning

Off-Policy Reinforcement Learning with High Dimensional Reward

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence