Abstract:Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$. The distributional temporal difference (TD) algorithm has been accordingly proposed, which is an extension of the temporal difference algorithm in the classic RL literature. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference algorithm (CTD) and quantile temporal difference algorithm (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose a non-parametric distributional TD algorithm (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal (up to logarithmic factors) in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance.

The surprising efficiency of temporal difference learning for rare event prediction

Statistical Efficiency of Distributional Temporal Difference Learning

On the Statistical Benefits of Temporal Difference Learning

Statistical Inference for Temporal Difference Learning with Linear Function Approximation

Reanalysis of Variance Reduced Temporal Difference Learning

Demystifying the Recency Heuristic in Temporal-Difference Learning

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Is Temporal Difference Learning Optimal? an Instance-Dependent Analysis

Almost Sure Convergence of Average Reward Temporal Difference Learning

Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces

A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning

Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

A Deep Reinforcement Learning Approach to Rare Event Estimation

Temporal Difference Learning with Experience Replay

Per-decision Multi-step Temporal Difference Learning with Control Variates

Improved High-Probability Bounds for the Temporal Difference Learning Algorithm via Exponential Stability

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning

Multikernel Recursive Least-Squares Temporal Difference Learning

An Analysis of Quantile Temporal-Difference Learning