Abstract:We address the issue of estimation bias in deep reinforcement learning (DRL) by introducing solution mechanisms that include a new, twin TD-regularized actor-critic (TDR) method. It aims at reducing both over and under-estimation errors. With TDR and by combining good DRL improvements, such as distributional learning and long N-step surrogate stage reward (LNSS) method, we show that our new TDR-based actor-critic learning has enabled DRL methods to outperform their respective baselines in challenging environments in DeepMind Control Suite. Furthermore, they elevate TD3 and SAC respectively to a level of performance comparable to that of D4PG (the current SOTA), and they also improve the performance of D4PG to a new SOTA level measured by mean reward, convergence speed, learning success rate, and learning variance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the common estimation bias problem in Deep Reinforcement Learning (DRL). Specifically, the paper focuses on the overestimation and underestimation errors that occur when using the Actor - Critic method. These errors may lead to inaccurate policy updates during the learning process, thereby affecting the overall performance of the algorithm. ### Main contributions of the paper: 1. **Proposed a new TD - regularized double critic networks**: By selecting the target value with the minimum TD error to reduce overestimation and underestimation problems. 2. **Introduced the TD - regularized actor network**: By adding the TD error as a regularization term in the actor update process to further reduce the estimation bias of the critic. 3. **Combined Distributional RL and Long N - step Surrogate Stage (LNSS) method**: To further improve the stability and performance of learning. ### Specific methods: - **TD - regularized double critic networks**: - Traditional methods usually directly select the minimum value of the two target networks to reduce the overestimation error, but this may lead to the underestimation error. - The new method simultaneously reduces overestimation and underestimation errors by selecting the target value with the minimum TD error. - **TD - regularized actor network**: - The TD error is added as a regularization term in the actor network update process to avoid misleading critic estimates. - This helps to further reduce the estimation bias of the critic, thereby improving the overall performance. ### Experimental results: - **Benchmark tests**: Experiments were carried out in multiple environments of the DeepMind Control Suite, including but not limited to Cheetah Run, Finger Turn, Quadruped Walk, etc. - **Performance improvement**: The TDR method significantly improves the performance of baseline algorithms (such as TD3, SAC, D4PG) in most environments, especially in sparse - reward and noisy environments. - **Robustness**: The TDR method shows stronger robustness when dealing with sparse rewards and dense random rewards. ### Theoretical analysis: - **Theorem 1**: It is proved that the TD - regularized double critic networks can more effectively reduce the estimation bias. - **Theorem 2**: It explains how the TD - regularized actor network can still perform effective policy updates in the presence of estimation bias. - **Theorem 3**: It analyzes the influence of sub - optimal policy updates on critic estimates and proves that the TD - regularized actor network can reduce this influence. ### Conclusion: By introducing the TD - regularization mechanism, the TDR method proposed in this paper has made significant progress in reducing the estimation bias in deep reinforcement learning, improving the performance and robustness of the algorithm in complex tasks.

Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for Deep Reinforcement Learning

Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

DSAC-T: Distributional Soft Actor-Critic with Three Refinements

Explorer-Actor-Critic: Better Actors for Deep Reinforcement Learning

Off-Policy Training for Truncated TD(\(\lambda \)) Boosted Soft Actor-Critic

Off-Policy Training for Truncated TD(λ) Boosted Soft Actor-Critic.

Efficient Continuous Control with Double Actors and Regularized Critics

WD3: Taming the Estimation Bias in Deep Reinforcement Learning

Broad Critic Deep Actor Reinforcement Learning for Continuous Control

Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic

Controlling Estimation Error in Reinforcement Learning via Reinforced Operation

Deep Reinforcement Learning for Autonomous Driving with an Auxiliary Actor Discriminator

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

GRAC: Self-Guided and Self-Regularized Actor-Critic

Actor-Critic Reinforcement Learning with Phased Actor

The Impact of Task Underspecification in Evaluating Deep Reinforcement Learning

DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization

A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement Learning with Provable Convergence

Demonstration actor critic

Optimizing TD3 for 7-DOF Robotic Arm Grasping: Overcoming Suboptimality with Exploration-Enhanced Contrastive Learning

Softmax Deep Double Deterministic Policy Gradients