Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Ávila Pires, Will Dabney
2024-02-09
Abstract:We introduce off-policy distributional Q($\lambda$), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q($\lambda$) does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q($\lambda$) from other existing alternatives such as distributional Retrace. We characterize the algorithmic properties of distributional Q($\lambda$) and validate theoretical insights with tabular experiments. We show how distributional Q($\lambda$)-C51, a combination of Q($\lambda$) with the C51 agent, exhibits promising results on deep RL benchmarks.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve an important problem in Reinforcement Learning (RL): **Develop an effective off - policy distributed Q(λ) algorithm without using Importance Sampling (IS)**. Specifically, the goals of this paper include: 1. **Avoid the limitations of importance sampling**: - Importance sampling is a commonly used technique in off - policy learning, but it has some key limitations, such as introducing high variance and being inapplicable when the probabilities of the data collection policy are unavailable. - The paper proposes a new multi - step distributed RL algorithm - off - policy distributed Q(λ), which does not need to use importance sampling. 2. **Improve existing distributed RL algorithms**: - Existing distributed RL algorithms such as Distributed Retrace rely on importance sampling to adjust the differences between the data collection policy and the target policy. - Off - policy distributed Q(λ) improves the performance of existing algorithms by introducing unique signed measures, providing a new method to handle off - policy learning. 3. **Theoretical analysis and experimental verification**: - The paper analyzes in detail the algorithmic properties of off - policy distributed Q(λ), including fixed - point and contraction properties, and verifies these theoretical insights through tabular experiments. - Combined with the C51 agent, it shows the promising results of off - policy distributed Q(λ)-C51 in deep RL benchmarks. 4. **Explore the application of signed measures**: - One of the unique features of distributed Q(λ) is the signed measures generated during its iterative process, which enables the algorithm to represent more complex distribution forms in intermediate iterations and finally converge to the target distribution. ### Summary The main contribution of this paper is the proposal of a brand - new off - policy distributed Q(λ) algorithm. This algorithm can effectively perform off - policy learning without relying on importance sampling and shows superior performance both theoretically and experimentally. In addition, the unique properties of this algorithm (such as the introduction of signed measures) provide new perspectives and tools for the research of distributed RL.