Abstract:Off-policy prediction-learning the value function for one policy from data generated while following another policy-is one of the most challenging problems in reinforcement learning. This article makes two main contributions: 1) it empirically studies 11 off-policy prediction learning algorithms with emphasis on their sensitivity to parameters, learning speed, and asymptotic error and 2) based on the empirical results, it proposes two step-size adaptation methods called and that help the algorithm with the lowest error from the experimental study learn faster. Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. In this article, we empirically compare 11 off-policy prediction learning algorithms with linear function approximation on three small tasks: the Collision task, the task, and the task. The Collision task is a small off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. The and tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as 214 . To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which, in turn, slows down learning. The task is more extreme in that the product of the ratios can become as large as 214 × 25 . The algorithms considered are Off-policy TD, five Gradient-TD algorithms, two Emphatic-TD algorithms, Vtrace, and variants of Tree Backup and ABQ that are applicable to the prediction setting. We found that the algorithms' performance is highly affected by the variance induced by the importance sampling ratios. Tree Backup, Vtrace, and ABTDare not affected by the high variance as much as other algorithms, but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TDtends to have lower asymptotic error than other algorithms but might learn more slowly in some cases. Based on the empirical results, we propose two step-size adaptation algorithms, which we collectively refer to as the Ratchet algorithms, with the same underlying idea: keep the step-size parameter as large as possible and ratchet it down only when necessary to avoid overshoot. We show that the Ratchet algorithms are effective by comparing them with other popular step-size adaptation algorithms, such as the Adam optimizer.

Scaling life-long off-policy learning

Generalize Robot Learning from Demonstration to Variant Scenarios with Evolutionary Policy Gradient

Multi-timescale nexting in a reinforcement learning robot

Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

Latent Plans for Task-Agnostic Offline Reinforcement Learning

Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting

Large-scale Kernel Methods and Applications to Lifelong Robot Learning

Towards model-free RL algorithms that scale well with unstructured data

LoopSR: Looping Sim-and-Real for Lifelong Policy Adaptation of Legged Robots

Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation

Off-Policy Prediction Learning: An Empirical Study of Online Algorithms

Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Data Scaling Laws in Imitation Learning for Robotic Manipulation

Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding

Mastering Stacking of Diverse Shapes with Large-Scale Iterative Reinforcement Learning on Real Robots

Robot Learning with Super-Linear Scaling

Lifelong Reinforcement Learning with Modulating Masks

Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning

Experience Recommendation for Long Term Safe Learning-based Model Predictive Control in Changing Operating Conditions