Abstract:Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$. The distributional temporal difference learning has been accordingly proposed, which is an extension of the temporal difference learning (TD) in the classic RL area. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference learning (CTD) and quantile temporal difference learning (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD learning (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal up to logarithmic factors in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance for $p\geq 1$.

Reanalysis of Variance Reduced Temporal Difference Learning

Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP

A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning

Improved High-Probability Bounds for the Temporal Difference Learning Algorithm via Exponential Stability

A Variance Minimization Approach to Temporal-Difference Learning

Almost Sure Convergence of Average Reward Temporal Difference Learning

Statistical Inference for Temporal Difference Learning with Linear Function Approximation

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

Per-decision Multi-step Temporal Difference Learning with Control Variates

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

Is Temporal Difference Learning Optimal? an Instance-Dependent Analysis

Temporal Difference Learning with Experience Replay

Revisiting a Design Choice in Gradient Temporal Difference Learning

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

An Analysis of Quantile Temporal-Difference Learning

Target-Based Temporal Difference Learning

The surprising efficiency of temporal difference learning for rare event prediction

Statistical Efficiency of Distributional Temporal Difference Learning

A Convergent Off-Policy Temporal Difference Algorithm

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

On the Statistical Benefits of Temporal Difference Learning