Abstract:Conventional off-policy reinforcement learning (RL) focuses on maximizing the expected return of scalar rewards. Distributional RL (DRL), in contrast, studies the distribution of returns with the distributional Bellman operator in a Euclidean space, leading to highly flexible choices for utility. This paper establishes robust theoretical foundations for DRL. We prove the contraction property of the Bellman operator even when the reward space is an infinite-dimensional separable Banach space. Furthermore, we demonstrate that the behavior of high- or infinite-dimensional returns can be effectively approximated using a lower-dimensional Euclidean space. Leveraging these theoretical insights, we propose a novel DRL algorithm that tackles problems which have been previously intractable using conventional reinforcement learning approaches.

What problem does this paper attempt to address?

This paper aims to address the limitations of traditional reinforcement learning methods when dealing with high - dimensional or infinite - dimensional reward spaces. Specifically, traditional methods mainly focus on maximizing the expected value of a single - dimensional reward, which may not fully capture the complexity of the decision - making process in many practical application scenarios. For example, in financial trading, an analyst may hope to find a strategy to maximize the first quartile of future assets rather than the average return; in clinical research, a researcher may hope to maximize the effect of a drug while limiting side effects. In these cases, it is very challenging to design a single - dimensional reward function that can accurately reflect multi - dimensional objectives. In addition, the reward space in some practical problems may itself be infinite - dimensional, such as brain - imaging data, whose continuous 3D structure and dynamic activities essentially form an infinite - dimensional reward space. Traditional reinforcement learning methods perform poorly in such scenarios. To solve these problems, the paper proposes a new distributed reinforcement learning (DRL) algorithm that can handle high - dimensional and even infinite - dimensional reward spaces. By introducing the distributed Bellman operator and proving its contraction property in infinite - dimensional separable Banach spaces, the algorithm can effectively approximate high - dimensional or infinite - dimensional reward distributions in lower - dimensional Euclidean spaces. This enables the algorithm to be applied to a wider range of problems, including those that require maximizing complex utility functions. ### Main contributions of the paper: 1. **Theoretical basis**: Established the theoretical basis of distributed reinforcement learning in infinite - dimensional separable Banach spaces and proved the contraction property of the distributed Bellman operator. 2. **Approximation method**: Proposed a method to approximate high - dimensional or infinite - dimensional reward distributions in low - dimensional Euclidean spaces, thereby improving computational efficiency. 3. **Algorithm design**: Based on the above theory, designed a new distributed reinforcement learning algorithm that can optimize any real - valued utility function in multi - dimensional reward spaces. 4. **Application verification**: Verified the effectiveness of the proposed algorithm through simulation experiments and demonstrated its advantages in dealing with complex multi - dimensional reward problems. ### Key technical points: - **Distributed Bellman operator**: Considers the entire reward distribution, not just the expected value, providing a more comprehensive understanding of the return landscape. - **Maximum - sliced Wasserstein distance**: Used to quantify the distance between multi - dimensional return distributions, with good computational efficiency and theoretical properties. - **Projection method**: Projects random variables in infinite - dimensional Banach spaces onto finite - dimensional Euclidean spaces to achieve efficient approximation. In summary, by establishing a theoretical basis and proposing effective approximation methods, this paper significantly expands the application range of distributed reinforcement learning, enabling it to handle more complex multi - dimensional reward problems.

Off-Policy Reinforcement Learning with High Dimensional Reward

Off-Dynamics Inverse Reinforcement Learning from Hetero-Domain

Distributional Reinforcement Learning for Multi-Dimensional Reward Functions

Foundations of Multivariate Distributional Reinforcement Learning

On solutions of the distributional Bellman equation

Reinforcement Leaning for Infinite-Dimensional Systems

On the Foundation of Distributionally Robust Reinforcement Learning

A Distributional Perspective on Reinforcement Learning

More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Normality-Guided Distributional Reinforcement Learning for Continuous Control

Policy Evaluation in Distributional LQR (Extended Version)

One-Step Distributional Reinforcement Learning

Action Gaps and Advantages in Continuous-Time Distributional Reinforcement Learning

Safe Distributional Reinforcement Learning

Bayesian Distributional Policy Gradients

The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions

Distributional Bellman Operators over Mean Embeddings

Distributional Reinforcement Learning With Quantile Regression

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Distributional Soft Actor Critic for Risk Sensitive Learning