Abstract:Deep reinforcement learning (DRL) algorithms have successfully been demonstrated on a range of challenging decision making and control tasks. One dominant component of recent deep reinforcement learning algorithms is the target network which mitigates the divergence when learning the Q function. However, target networks can slow down the learning process due to delayed function updates. Our main contribution in this work is a self-regularized TD-learning method to address divergence without requiring a target network. Additionally, we propose a self-guided policy improvement method by combining policy-gradient with zero-order optimization to search for actions associated with higher Q-values in a broad neighborhood. This makes learning more robust to local noise in the Q function approximation and guides the updates of our actor network. Taken together, these components define GRAC, a novel self-guided and self-regularized actor critic algorithm. We evaluate GRAC on the suite of OpenAI gym tasks, achieving or outperforming state of the art in every environment tested.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are two core challenges in deep reinforcement learning (DRL): 1. **Learning delay caused by the target network**: In existing DRL algorithms, in order to alleviate the divergence problem in the Q - function learning process, a target network is usually used. The target network provides a stable update target by copying the current Q - function and remaining fixed for a period of time. However, this method will slow down the learning process because there is a delay in the parameter update of the target network. 2. **Instability of Q - value estimation**: In the Q - learning process, using a single Q - function is prone to the over - estimation problem, that is, assigning unrealistically high values to some state - action pairs. This over - estimation will affect the quality of the Q - value - based greedy strategy. In addition, although existing double Q - learning methods (such as Clipped Double Q - Learning) can reduce over - estimation, they may also lead to under - estimation problems, and the difference between the two Q - functions may increase significantly, thus affecting the stability of learning. To solve the above problems, the paper proposes GRAC (Guided and Regularized Actor - Critic), a self - guided and self - regularized actor - critic algorithm. The specific contributions are as follows: - **Self - regularized TD learning**: A self - regularized TD learning method is proposed. By adding a regularization term to limit the change range of the Q - function while minimizing the TD error, the stability and rapid convergence of learning can be maintained without using the target network. - **Self - guided policy improvement**: Combining the policy gradient and zero - order optimization methods (such as the cross - entropy method, CEM), high - Q - value actions in the neighborhood are searched based on the initial action, thereby accelerating the learning process and improving the robustness of the policy. - **Max - min double Q - learning**: A new double Q - learning method - Max - min Double Q - Learning is proposed. By performing max - min operations between the two Q - functions, the difference between them is balanced, providing a better approximation of the Bellman optimality operator. Through these innovations, GRAC has achieved performance comparable to or better than the existing state - of - the - art methods on multiple OpenAI Gym continuous control tasks.

GRAC: Self-Guided and Self-Regularized Actor-Critic

Self-play Reinforcement Learning with Comprehensive Critic in Computer Games

Actor–Critic Learning Control with Regularization and Feature Selection in Policy Gradient Estimation

Actor-Critic Reinforcement Learning with Phased Actor

Actor-Critic Learning Control Based on $\ell_{2}$ -Regularized Temporal-Difference Prediction with Gradient Correction

Broad Critic Deep Actor Reinforcement Learning for Continuous Control

DSAC-T: Distributional Soft Actor-Critic with Three Refinements

The Actor-Dueling-Critic Method for Reinforcement Learning.

Actor-Critic Algorithm Based on Incremental Least-Squares Temporal Difference with Eligibility Trace.

Task-Oriented Deep Reinforcement Learning for Robotic Skill Acquisition and Control

Self-Guided Actor-Critic: Reinforcement Learning from Adaptive Expert Demonstrations

A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement Learning with Provable Convergence

Explorer-Actor-Critic: Better Actors for Deep Reinforcement Learning

PRAG: Periodic Regularized Action Gradient for Efficient Continuous Control

An Advanced Actor-Critic Algorithm for Training Video Game AI

Generative Actor-Critic: An Off-policy Algorithm Using the Push-forward Model

Dual Behavior Regularized Offline Deterministic Actor–Critic

CGAR: Critic Guided Action Redistribution in Reinforcement Leaning

Generalizing soft actor-critic algorithms to discrete action spaces

QVDDPG: QV Learning with Balanced Constraint in Actor-Critic Framework.

Deep Reinforcement Learning on Autonomous Driving Policy With Auxiliary Critic Network