GRAC: Self-Guided and Self-Regularized Actor-Critic

Lin Shao,Yifan You,Mengyuan Yan,Qingyun Sun,Jeannette Bohg
DOI: https://doi.org/10.48550/arXiv.2009.08973
2020-11-11
Abstract:Deep reinforcement learning (DRL) algorithms have successfully been demonstrated on a range of challenging decision making and control tasks. One dominant component of recent deep reinforcement learning algorithms is the target network which mitigates the divergence when learning the Q function. However, target networks can slow down the learning process due to delayed function updates. Our main contribution in this work is a self-regularized TD-learning method to address divergence without requiring a target network. Additionally, we propose a self-guided policy improvement method by combining policy-gradient with zero-order optimization to search for actions associated with higher Q-values in a broad neighborhood. This makes learning more robust to local noise in the Q function approximation and guides the updates of our actor network. Taken together, these components define GRAC, a novel self-guided and self-regularized actor critic algorithm. We evaluate GRAC on the suite of OpenAI gym tasks, achieving or outperforming state of the art in every environment tested.
Machine Learning,Artificial Intelligence,Robotics,Systems and Control
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are two core challenges in deep reinforcement learning (DRL): 1. **Learning delay caused by the target network**: In existing DRL algorithms, in order to alleviate the divergence problem in the Q - function learning process, a target network is usually used. The target network provides a stable update target by copying the current Q - function and remaining fixed for a period of time. However, this method will slow down the learning process because there is a delay in the parameter update of the target network. 2. **Instability of Q - value estimation**: In the Q - learning process, using a single Q - function is prone to the over - estimation problem, that is, assigning unrealistically high values to some state - action pairs. This over - estimation will affect the quality of the Q - value - based greedy strategy. In addition, although existing double Q - learning methods (such as Clipped Double Q - Learning) can reduce over - estimation, they may also lead to under - estimation problems, and the difference between the two Q - functions may increase significantly, thus affecting the stability of learning. To solve the above problems, the paper proposes GRAC (Guided and Regularized Actor - Critic), a self - guided and self - regularized actor - critic algorithm. The specific contributions are as follows: - **Self - regularized TD learning**: A self - regularized TD learning method is proposed. By adding a regularization term to limit the change range of the Q - function while minimizing the TD error, the stability and rapid convergence of learning can be maintained without using the target network. - **Self - guided policy improvement**: Combining the policy gradient and zero - order optimization methods (such as the cross - entropy method, CEM), high - Q - value actions in the neighborhood are searched based on the initial action, thereby accelerating the learning process and improving the robustness of the policy. - **Max - min double Q - learning**: A new double Q - learning method - Max - min Double Q - Learning is proposed. By performing max - min operations between the two Q - functions, the difference between them is balanced, providing a better approximation of the Bellman optimality operator. Through these innovations, GRAC has achieved performance comparable to or better than the existing state - of - the - art methods on multiple OpenAI Gym continuous control tasks.