Abstract:Off-policy prediction-learning the value function for one policy from data generated while following another policy-is one of the most challenging problems in reinforcement learning. This article makes two main contributions: 1) it empirically studies 11 off-policy prediction learning algorithms with emphasis on their sensitivity to parameters, learning speed, and asymptotic error and 2) based on the empirical results, it proposes two step-size adaptation methods called and that help the algorithm with the lowest error from the experimental study learn faster. Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. In this article, we empirically compare 11 off-policy prediction learning algorithms with linear function approximation on three small tasks: the Collision task, the task, and the task. The Collision task is a small off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. The and tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as 214 . To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which, in turn, slows down learning. The task is more extreme in that the product of the ratios can become as large as 214 × 25 . The algorithms considered are Off-policy TD, five Gradient-TD algorithms, two Emphatic-TD algorithms, Vtrace, and variants of Tree Backup and ABQ that are applicable to the prediction setting. We found that the algorithms' performance is highly affected by the variance induced by the importance sampling ratios. Tree Backup, Vtrace, and ABTDare not affected by the high variance as much as other algorithms, but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TDtends to have lower asymptotic error than other algorithms but might learn more slowly in some cases. Based on the empirical results, we propose two step-size adaptation algorithms, which we collectively refer to as the Ratchet algorithms, with the same underlying idea: keep the step-size parameter as large as possible and ratchet it down only when necessary to avoid overshoot. We show that the Ratchet algorithms are effective by comparing them with other popular step-size adaptation algorithms, such as the Adam optimizer.

Off-Policy Prediction Learning: An Empirical Study of Online Algorithms

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

Greedy-Step Off-Policy Reinforcement Learning

Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Adaptive Step-Size for Online Temporal Difference Learning

Investigating practical linear temporal difference learning

Offline Multi-Action Policy Learning: Generalization and Optimization

A Convergent Off-Policy Temporal Difference Algorithm

Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

Robust Offline Reinforcement Learning from Low-Quality Data

Efficient Offline Reinforcement Learning: The Critic is Critical

Examining the Use of Temporal-Difference Incremental Delta-Bar-Delta for Real-World Predictive Knowledge Architectures

True Online Temporal-Difference Learning

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Efficient Online Reinforcement Learning with Offline Data

Offline RL Policies Should be Trained to be Adaptive

Meta-descent for Online, Continual Prediction

Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards

Variance-Reduced Off-Policy Memory-Efficient Policy Search

Scaling life-long off-policy learning