On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Julian Gerstenberg,Ralph Neininger,Denis Spiegel
2024-07-19
Abstract:We introduce a novel class of algorithms to efficiently approximate the unknown return distributions in policy evaluation problems from distributional reinforcement learning (DRL). The proposed distributional dynamic programming algorithms are suitable for underlying Markov decision processes (MDPs) having an arbitrary probabilistic reward mechanism, including continuous reward distributions with unbounded support being potentially heavy-tailed. For a plain instance of our proposed class of algorithms we prove error bounds, both within Wasserstein and Kolmogorov--Smirnov distances. Furthermore, for return distributions having probability density functions the algorithms yield approximations for these densities; error bounds are given within supremum norm. We introduce the concept of quantile-spline discretizations to come up with algorithms showing promising results in simulation experiments. While the performance of our algorithms can rigorously be analysed they can be seen as universal black box algorithms applicable to a large class of MDPs. We also derive new properties of probability metrics commonly used in DRL on which our quantitative analysis is based.
Machine Learning,Probability
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently approximate the unknown return distribution in distributional reinforcement learning (DRL). Specifically, the paper introduces a new class of algorithms to approximate the return distribution in Markov decision processes (MDPs). These algorithms are applicable to MDPs with arbitrary probability reward mechanisms, including continuous, unbounded - support, and possibly heavy - tailed reward distributions. ### Main Problems 1. **Approximating Return Distribution**: - How to efficiently approximate the unknown return distribution in distributional reinforcement learning? - Especially for cases where the reward distribution is not of finite support, how to design and analyze algorithms? 2. **Algorithm Performance and Complexity**: - How do these algorithms perform? Can they be strictly analyzed theoretically? - What are the time complexity and space complexity of the algorithms? How to optimize these complexities while ensuring the approximation quality? 3. **Scope of Application**: - Are these algorithms applicable to a wide range of MDPs, including those with continuous, unbounded - support, or heavy - tailed reward distributions? - Is it possible to provide a general "black - box" algorithm applicable to a large class of MDPs? ### Solutions The paper proposes the following methods to solve the above problems: 1. **Distributional Dynamic Programming (DDP)**: - A new class of DDP algorithms is introduced. These algorithms approximate the return distribution by iterating the distributional Bellman operator (DBO) and combining it with a projection step. - The projection step is allowed to depend on the update step, and the size of the finite - support representation is allowed to increase with each iteration. 2. **Quantile - Spline Discretizations**: - A method of quantile - spline discretization is proposed, which performs well in simulation experiments. - The quantile values are approximated by linear spline interpolation, so it is still effective for a wider range of reward distribution types. 3. **Error Analysis**: - A strict theoretical analysis of the algorithm's error is carried out, including error bounds in Wasserstein distance and Kolmogorov - Smirnov distance. - Error bounds for approximating the density function in some cases are provided, and sufficient conditions for the return distribution to have a density are discussed. ### Application Background These algorithms are of great significance in practical applications, especially in pricing and trading problems in the financial and insurance fields. These problems usually involve MDPs with complex reward distributions. For example, the studies by Krasheninnikova et al. (2019) and Kolm and Ritter (2020) show that this type of algorithm has broad application prospects in these fields. ### Summary The main contribution of the paper is to propose a new class of DDP algorithms. These algorithms are applicable not only to MDPs with finite - support reward distributions but also to MDPs with continuous, unbounded - support, or heavy - tailed reward distributions. Through strict theoretical analysis and experimental verification, the effectiveness and robustness of these algorithms in approximating the return distribution are proved.