Abstract:In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted $\eta^\pi$) attained by a given policy $\pi$. We use the certainty-equivalence method to construct our estimator $\hat\eta^\pi$, given a generative model is available. In this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2p}(1-\gamma)^{2p+2}}\right)$ to guarantee the $p$-Wasserstein metric between $\hat\eta^\pi$ and $\eta^\pi$ less than $\varepsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2}(1-\gamma)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hat\eta^\pi$ and $\eta^\pi$ is below $\varepsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hat\eta^\pi$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hat\eta^\pi-\eta^\pi)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{\text{W}})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\text{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\text{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $\eta^\pi$.

Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

Distributional Reinforcement Learning With Quantile Regression

Implicit Quantile Networks for Distributional Reinforcement Learning

Fully Parameterized Quantile Function for Distributional Reinforcement Learning.

A Robust Quantile Huber Loss With Interpretable Parameter Adjustment In Distributional Reinforcement Learning

Distributional Reinforcement Learning for Multi-Dimensional Reward Functions

Quantile Regression for Distributional Reward Models in RLHF

Distributional reinforcement learning with epistemic and aleatoric uncertainty estimation

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Policy Evaluation in Distributional LQR (Extended Version)

Non-crossing quantile regression for deep reinforcement learning

Value-Distributional Model-Based Reinforcement Learning

Statistical Efficiency of Distributional Temporal Difference Learning

Estimation and Inference in Distributional Reinforcement Learning

Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

An Analysis of Quantile Temporal-Difference Learning

Non-decreasing Quantile Function Network with Efficient Exploration for Distributional Reinforcement Learning

Regression via Arbitrary Quantile Modeling

A Distributional Perspective on Reinforcement Learning

Foundations of Multivariate Distributional Reinforcement Learning