Abstract:We introduce a novel class of algorithms to efficiently approximate the unknown return distributions in policy evaluation problems from distributional reinforcement learning (DRL). The proposed distributional dynamic programming algorithms are suitable for underlying Markov decision processes (MDPs) having an arbitrary probabilistic reward mechanism, including continuous reward distributions with unbounded support being potentially heavy-tailed. For a plain instance of our proposed class of algorithms we prove error bounds, both within Wasserstein and Kolmogorov--Smirnov distances. Furthermore, for return distributions having probability density functions the algorithms yield approximations for these densities; error bounds are given within supremum norm. We introduce the concept of quantile-spline discretizations to come up with algorithms showing promising results in simulation experiments. While the performance of our algorithms can rigorously be analysed they can be seen as universal black box algorithms applicable to a large class of MDPs. We also derive new properties of probability metrics commonly used in DRL on which our quantitative analysis is based.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently approximate the unknown return distribution in distributional reinforcement learning (DRL). Specifically, the paper introduces a new class of algorithms to approximate the return distribution in Markov decision processes (MDPs). These algorithms are applicable to MDPs with arbitrary probability reward mechanisms, including continuous, unbounded - support, and possibly heavy - tailed reward distributions. ### Main Problems 1. **Approximating Return Distribution**: - How to efficiently approximate the unknown return distribution in distributional reinforcement learning? - Especially for cases where the reward distribution is not of finite support, how to design and analyze algorithms? 2. **Algorithm Performance and Complexity**: - How do these algorithms perform? Can they be strictly analyzed theoretically? - What are the time complexity and space complexity of the algorithms? How to optimize these complexities while ensuring the approximation quality? 3. **Scope of Application**: - Are these algorithms applicable to a wide range of MDPs, including those with continuous, unbounded - support, or heavy - tailed reward distributions? - Is it possible to provide a general "black - box" algorithm applicable to a large class of MDPs? ### Solutions The paper proposes the following methods to solve the above problems: 1. **Distributional Dynamic Programming (DDP)**: - A new class of DDP algorithms is introduced. These algorithms approximate the return distribution by iterating the distributional Bellman operator (DBO) and combining it with a projection step. - The projection step is allowed to depend on the update step, and the size of the finite - support representation is allowed to increase with each iteration. 2. **Quantile - Spline Discretizations**: - A method of quantile - spline discretization is proposed, which performs well in simulation experiments. - The quantile values are approximated by linear spline interpolation, so it is still effective for a wider range of reward distribution types. 3. **Error Analysis**: - A strict theoretical analysis of the algorithm's error is carried out, including error bounds in Wasserstein distance and Kolmogorov - Smirnov distance. - Error bounds for approximating the density function in some cases are provided, and sufficient conditions for the return distribution to have a density are discussed. ### Application Background These algorithms are of great significance in practical applications, especially in pricing and trading problems in the financial and insurance fields. These problems usually involve MDPs with complex reward distributions. For example, the studies by Krasheninnikova et al. (2019) and Kolm and Ritter (2020) show that this type of algorithm has broad application prospects in these fields. ### Summary The main contribution of the paper is to propose a new class of DDP algorithms. These algorithms are applicable not only to MDPs with finite - support reward distributions but also to MDPs with continuous, unbounded - support, or heavy - tailed reward distributions. Through strict theoretical analysis and experimental verification, the effectiveness and robustness of these algorithms in approximating the return distribution are proved.

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Policy Evaluation in Distributional LQR (Extended Version)

Estimation and Inference in Distributional Reinforcement Learning

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Distributional Reinforcement Learning With Quantile Regression

Normality-Guided Distributional Reinforcement Learning for Continuous Control

Distributional Policy Gradient with Distributional Value Function

Distributional reinforcement learning with epistemic and aleatoric uncertainty estimation

Value-Distributional Model-Based Reinforcement Learning

Bayesian Distributional Policy Gradients

Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces

A Distributional Perspective on Reinforcement Learning

Off-Policy Reinforcement Learning with High Dimensional Reward

Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation

Distributional Method for Risk Averse Reinforcement Learning

Distributional Reinforcement Learning for Multi-Dimensional Reward Functions

On solutions of the distributional Bellman equation

Foundations of Multivariate Distributional Reinforcement Learning

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

Assessing the Impact of Distribution Shift on Reinforcement Learning Performance

Sample-based Distributional Policy Gradient