Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for Deep Reinforcement Learning

Junmin Zhong,Ruofan Wu,Jennie Si
2023-11-07
Abstract:We address the issue of estimation bias in deep reinforcement learning (DRL) by introducing solution mechanisms that include a new, twin TD-regularized actor-critic (TDR) method. It aims at reducing both over and under-estimation errors. With TDR and by combining good DRL improvements, such as distributional learning and long N-step surrogate stage reward (LNSS) method, we show that our new TDR-based actor-critic learning has enabled DRL methods to outperform their respective baselines in challenging environments in DeepMind Control Suite. Furthermore, they elevate TD3 and SAC respectively to a level of performance comparable to that of D4PG (the current SOTA), and they also improve the performance of D4PG to a new SOTA level measured by mean reward, convergence speed, learning success rate, and learning variance.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the common estimation bias problem in Deep Reinforcement Learning (DRL). Specifically, the paper focuses on the overestimation and underestimation errors that occur when using the Actor - Critic method. These errors may lead to inaccurate policy updates during the learning process, thereby affecting the overall performance of the algorithm. ### Main contributions of the paper: 1. **Proposed a new TD - regularized double critic networks**: By selecting the target value with the minimum TD error to reduce overestimation and underestimation problems. 2. **Introduced the TD - regularized actor network**: By adding the TD error as a regularization term in the actor update process to further reduce the estimation bias of the critic. 3. **Combined Distributional RL and Long N - step Surrogate Stage (LNSS) method**: To further improve the stability and performance of learning. ### Specific methods: - **TD - regularized double critic networks**: - Traditional methods usually directly select the minimum value of the two target networks to reduce the overestimation error, but this may lead to the underestimation error. - The new method simultaneously reduces overestimation and underestimation errors by selecting the target value with the minimum TD error. - **TD - regularized actor network**: - The TD error is added as a regularization term in the actor network update process to avoid misleading critic estimates. - This helps to further reduce the estimation bias of the critic, thereby improving the overall performance. ### Experimental results: - **Benchmark tests**: Experiments were carried out in multiple environments of the DeepMind Control Suite, including but not limited to Cheetah Run, Finger Turn, Quadruped Walk, etc. - **Performance improvement**: The TDR method significantly improves the performance of baseline algorithms (such as TD3, SAC, D4PG) in most environments, especially in sparse - reward and noisy environments. - **Robustness**: The TDR method shows stronger robustness when dealing with sparse rewards and dense random rewards. ### Theoretical analysis: - **Theorem 1**: It is proved that the TD - regularized double critic networks can more effectively reduce the estimation bias. - **Theorem 2**: It explains how the TD - regularized actor network can still perform effective policy updates in the presence of estimation bias. - **Theorem 3**: It analyzes the influence of sub - optimal policy updates on critic estimates and proves that the TD - regularized actor network can reduce this influence. ### Conclusion: By introducing the TD - regularization mechanism, the TDR method proposed in this paper has made significant progress in reducing the estimation bias in deep reinforcement learning, improving the performance and robustness of the algorithm in complex tasks.