Abstract:Off-policy prediction-learning the value function for one policy from data generated while following another policy-is one of the most challenging problems in reinforcement learning. This article makes two main contributions: 1) it empirically studies 11 off-policy prediction learning algorithms with emphasis on their sensitivity to parameters, learning speed, and asymptotic error and 2) based on the empirical results, it proposes two step-size adaptation methods called and that help the algorithm with the lowest error from the experimental study learn faster. Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. In this article, we empirically compare 11 off-policy prediction learning algorithms with linear function approximation on three small tasks: the Collision task, the task, and the task. The Collision task is a small off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. The and tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as 214 . To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which, in turn, slows down learning. The task is more extreme in that the product of the ratios can become as large as 214 × 25 . The algorithms considered are Off-policy TD, five Gradient-TD algorithms, two Emphatic-TD algorithms, Vtrace, and variants of Tree Backup and ABQ that are applicable to the prediction setting. We found that the algorithms' performance is highly affected by the variance induced by the importance sampling ratios. Tree Backup, Vtrace, and ABTDare not affected by the high variance as much as other algorithms, but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TDtends to have lower asymptotic error than other algorithms but might learn more slowly in some cases. Based on the empirical results, we propose two step-size adaptation algorithms, which we collectively refer to as the Ratchet algorithms, with the same underlying idea: keep the step-size parameter as large as possible and ratchet it down only when necessary to avoid overshoot. We show that the Ratchet algorithms are effective by comparing them with other popular step-size adaptation algorithms, such as the Adam optimizer.

Sequence Compression Speeds Up Credit Assignment in Reinforcement Learning

Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning

Demystifying the Recency Heuristic in Temporal-Difference Learning

Selective Credit Assignment

A Survey of Temporal Credit Assignment in Deep Reinforcement Learning

Learning to assign credit in reinforcement learning by incorporating abstract relations

Temporal-Difference Learning Using Distributed Error Signals

Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning

Credit Assignment During Movement Reinforcement Learning.

Towards Practical Credit Assignment for Deep Reinforcement Learning

Simplifying Deep Temporal Difference Learning

Reinforcement learning under temporal logic constraints as a sequence modeling problem

Model-based Credit Assignment for Model-free Deep Reinforcement Learning

Off-Policy Training for Truncated TD(\(\lambda \)) Boosted Soft Actor-Critic

Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning

Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis

Reinforcement learning under temporal logic constraints as a sequence modelling problem

Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

Off-Policy Training for Truncated TD(λ) Boosted Soft Actor-Critic.

Off-Policy Prediction Learning: An Empirical Study of Online Algorithms