Abstract:We introduce off-policy distributional Q($\lambda$), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q($\lambda$) does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q($\lambda$) from other existing alternatives such as distributional Retrace. We characterize the algorithmic properties of distributional Q($\lambda$) and validate theoretical insights with tabular experiments. We show how distributional Q($\lambda$)-C51, a combination of Q($\lambda$) with the C51 agent, exhibits promising results on deep RL benchmarks.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve an important problem in Reinforcement Learning (RL): **Develop an effective off - policy distributed Q(λ) algorithm without using Importance Sampling (IS)**. Specifically, the goals of this paper include: 1. **Avoid the limitations of importance sampling**: - Importance sampling is a commonly used technique in off - policy learning, but it has some key limitations, such as introducing high variance and being inapplicable when the probabilities of the data collection policy are unavailable. - The paper proposes a new multi - step distributed RL algorithm - off - policy distributed Q(λ), which does not need to use importance sampling. 2. **Improve existing distributed RL algorithms**: - Existing distributed RL algorithms such as Distributed Retrace rely on importance sampling to adjust the differences between the data collection policy and the target policy. - Off - policy distributed Q(λ) improves the performance of existing algorithms by introducing unique signed measures, providing a new method to handle off - policy learning. 3. **Theoretical analysis and experimental verification**: - The paper analyzes in detail the algorithmic properties of off - policy distributed Q(λ), including fixed - point and contraction properties, and verifies these theoretical insights through tabular experiments. - Combined with the C51 agent, it shows the promising results of off - policy distributed Q(λ)-C51 in deep RL benchmarks. 4. **Explore the application of signed measures**: - One of the unique features of distributed Q(λ) is the signed measures generated during its iterative process, which enables the algorithm to represent more complex distribution forms in intermediate iterations and finally converge to the target distribution. ### Summary The main contribution of this paper is the proposal of a brand - new off - policy distributed Q(λ) algorithm. This algorithm can effectively perform off - policy learning without relying on importance sampling and shows superior performance both theoretically and experimentally. In addition, the unique properties of this algorithm (such as the introduction of signed measures) provide new perspectives and tools for the research of distributed RL.

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

Distributional Reinforcement Learning With Quantile Regression

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

Implicit Quantile Networks for Distributional Reinforcement Learning

Fully Parameterized Quantile Function for Distributional Reinforcement Learning.

Policy Evaluation in Distributional LQR (Extended Version)

Offline RL with No OOD Actions: In-Sample Learning Via Implicit Value Regularization

Quantile Regression for Distributional Reward Models in RLHF

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Distributional Reinforcement Learning for Multi-Dimensional Reward Functions

Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model

Implicitly Regularized RL with Implicit Q-Values

Safe Distributional Reinforcement Learning

Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

A Robust Quantile Huber Loss With Interpretable Parameter Adjustment In Distributional Reinforcement Learning

Distributional Soft Actor Critic for Risk Sensitive Learning

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors