Abstract:Off-policy learning, where the goal is to learn about a policy of interest while following a different behavior policy, constitutes an important class of reinforcement learning problems. It is well-known that emphatic temporal-difference (TD) learning is a pioneering off-policy reinforcement learning method involving the use of the followon trace. Although the gradient emphasis learning (GEM) algorithm has recently been proposed to fix the problems of unbounded variance and large emphasis approximation error introduced by the followon trace from the perspective of stochastic approximation. This approach, however, is limited to a single gradient-TD2-style update instead of considering the update rules of other GTD algorithms. Overall, it remains an open question on how to better learn the emphasis for off-policy learning. In this paper, we rethink GEM and investigate introducing a novel two-time-scale algorithm called TD emphasis learning with gradient correction (TDEC) to learn the true emphasis. Further, we regularize the update to the secondary learning process of TDEC and obtain our final TD emphasis learning with regularized correction (TDERC) algorithm. We then apply the emphasis estimated by the proposed emphasis learning algorithms to the value estimation gradient and the policy gradient, respectively, yielding the corresponding emphatic TD variants for off-policy evaluation and actor-critic algorithms for off-policy control. Finally, we empirically demonstrate the advantage of the proposed algorithms on a small domain as well as challenging Mujoco robot simulation tasks. Taken together, we hope that our work can provide new insights into the development of a better alternative in the family of off-policy emphatic algorithms.

Revisiting a Design Choice in Gradient Temporal Difference Learning

Gradient Descent Temporal Difference-Difference Learning

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

New Versions of Gradient Temporal Difference Learning

Modified Retrace for Off-Policy Temporal Difference Learning.

Why Target Networks Stabilise Temporal Difference Methods

Demystifying the Recency Heuristic in Temporal-Difference Learning

Temporal Difference Learning as Gradient Splitting

Temporal-difference Emphasis Learning with Regularized Correction for Off-Policy Evaluation and Control

Investigating practical linear temporal difference learning

Reanalysis of Variance Reduced Temporal Difference Learning

Target-Based Temporal Difference Learning

Gradient Temporal Difference with Momentum: Stability and Convergence

A Temporal-Difference Approach to Policy Gradient Estimation

Per-decision Multi-step Temporal Difference Learning with Control Variates

Accelerated Gradient Temporal Difference Learning

Is Temporal Difference Learning Optimal? an Instance-Dependent Analysis

Temporal Difference Learning with Experience Replay

Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

Gradient temporal-difference learning for off-policy evaluation using emphatic weightings

Simplifying Deep Temporal Difference Learning