Abstract:Actor-critic algorithms learn an explicit policy (actor), and an accompanying value function (critic). The actor performs actions in the environment, while the critic evaluates the actor's current policy. However, despite their stability and promising convergence properties, current actor-critic algorithms do not outperform critic-only ones in practice. We believe that the fact that the critic learns Q^pi, instead of the optimal Q-function Q*, prevents state-of-the-art robust and sample-efficient off-policy learning algorithms from being used. In this paper, we propose an elegant solution, the Actor-Advisor architecture, in which a Policy Gradient actor learns from unbiased Monte-Carlo returns, while being shaped (or advised) by the Softmax policy arising from an off-policy critic. The critic can be learned independently from the actor, using any state-of-the-art algorithm. Being advised by a high-quality critic, the actor quickly and robustly learns the task, while its use of the Monte-Carlo return helps overcome any bias the critic may have. In addition to a new Actor-Critic formulation, the Actor-Advisor, a method that allows an external advisory policy to shape a Policy Gradient actor, can be applied to many other domains. By varying the source of advice, we demonstrate the wide applicability of the Actor-Advisor to three other important subfields of RL: safe RL with backup policies, efficient leverage of domain knowledge, and transfer learning in RL. Our experimental results demonstrate the benefits of the Actor-Advisor compared to state-of-the-art actor-critic methods, illustrate its applicability to the three other application scenarios listed above, and show that many important challenges of RL can now be solved using a single elegant solution.

Episode-Experience Replay Based Tree-Backup Method for Off-Policy Actor-Critic Algorithm.

Frugal Actor-Critic: Sample Efficient Off-Policy Deep Reinforcement Learning Using Unique Experiences

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Z-Score Experience Replay in Off-Policy Deep Reinforcement Learning

Leveraging Efficiency Through Hybrid Prioritized Experience Replay in Door Environment.

Actor Prioritized Experience Replay

Natural Gradient Based Reinforcement Learning Algorithm Using Active Stimulating

The Actor-Advisor: Policy Gradient With Off-Policy Advice

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

A priority experience replay actor-critic algorithm using self-attention mechanism for strategy optimization of discrete problems

Re-attentive experience replay in off-policy reinforcement learning

Ddper - Decentralized Distributed Prioritized Experience Replay.

On-Policy Trust Region Policy Optimisation with Replay Buffers

Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

Actor-Critic Reinforcement Learning with Phased Actor

ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms

Off-Policy Correction for Deep Deterministic Policy Gradient Algorithms via Batch Prioritized Experience Replay

Improvements on Hindsight Learning

Recursive Least Squares Advantage Actor-Critic Algorithms