Abstract:Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$. The distributional temporal difference learning has been accordingly proposed, which is an extension of the temporal difference learning (TD) in the classic RL area. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference learning (CTD) and quantile temporal difference learning (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD learning (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal up to logarithmic factors in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance for $p\geq 1$.

Statistical Inference for Temporal Difference Learning with Linear Function Approximation

Improved High-Probability Bounds for the Temporal Difference Learning Algorithm via Exponential Stability

Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

On the Statistical Benefits of Temporal Difference Learning

High-probability sample complexities for policy evaluation with linear function approximation

Reanalysis of Variance Reduced Temporal Difference Learning

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

Exact Formulas for Finite-Time Estimation Errors of Decentralized Temporal Difference Learning with Linear Function Approximation

Federated Temporal Difference Learning with Linear Function Approximation under Environmental Heterogeneity

A Variance Minimization Approach to Temporal-Difference Learning

Temporal Difference Learning as Gradient Splitting

A Convergent Off-Policy Temporal Difference Algorithm

Investigating practical linear temporal difference learning

Statistical Efficiency of Distributional Temporal Difference Learning

Almost Sure Convergence of Linear Temporal Difference Learning with Arbitrary Features

Gradient Descent Temporal Difference-Difference Learning

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning