Abstract:In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted $\eta^\pi$) attained by a given policy $\pi$. We use the certainty-equivalence method to construct our estimator $\hat\eta^\pi$, given a generative model is available. In this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2p}(1-\gamma)^{2p+2}}\right)$ to guarantee the $p$-Wasserstein metric between $\hat\eta^\pi$ and $\eta^\pi$ less than $\varepsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2}(1-\gamma)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hat\eta^\pi$ and $\eta^\pi$ is below $\varepsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hat\eta^\pi$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hat\eta^\pi-\eta^\pi)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{\text{W}})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\text{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\text{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $\eta^\pi$.

On solutions of the distributional Bellman equation

Off-Policy Reinforcement Learning with High Dimensional Reward

A Distributional Perspective on Reinforcement Learning

Distributional Reinforcement Learning for Multi-Dimensional Reward Functions

Policy Evaluation in Distributional LQR (Extended Version)

Safe Distributional Reinforcement Learning

Estimation and Inference in Distributional Reinforcement Learning

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

How Does Value Distribution in Distributional Reinforcement Learning Help Optimization?

Distributional Bellman Operators over Mean Embeddings

More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Foundations of Multivariate Distributional Reinforcement Learning

Distributional Reinforcement Learning With Quantile Regression

One-Step Distributional Reinforcement Learning

Value-Distributional Model-Based Reinforcement Learning

Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

Bayesian Distributional Policy Gradients

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

A Distributional Analogue to the Successor Representation

Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

How Does Return Distribution in Distributional Reinforcement Learning Help Optimization?