Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

Haohui Chen,Zhiyong Chen,Aoxiang Liu,Wentuo Fang
2024-09-28
Abstract:To obtain better value estimation in reinforcement learning, we propose a novel algorithm based on the double actor-critic framework with temporal difference error-driven regularization, abbreviated as TDDR. TDDR employs double actors, with each actor paired with a critic, thereby fully leveraging the advantages of double critics. Additionally, TDDR introduces an innovative critic regularization architecture. Compared to classical deterministic policy gradient-based algorithms that lack a double actor-critic structure, TDDR provides superior estimation. Moreover, unlike existing algorithms with double actor-critic frameworks, TDDR does not introduce any additional hyperparameters, significantly simplifying the design and implementation process. Experiments demonstrate that TDDR exhibits strong competitiveness compared to benchmark algorithms in challenging continuous control tasks.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to obtain better value estimation in reinforcement learning. Specifically, the paper proposes a Temporal - Difference Error - Driven Regularization (TDDR) algorithm based on the double - actor - critic framework, aiming to overcome the over - estimation bias problem in existing algorithms and improve performance in continuous - control tasks without introducing additional hyper - parameters. ### Main Contributions: 1. **Performance Improvement**: TDDR performs excellently in challenging continuous - control tasks, outperforming the benchmark algorithms without introducing additional hyper - parameters. 2. **Innovative Regularization Method**: TDDR introduces a temporal - difference error - driven regularization method in the double - actor - critic framework, further utilizing the double - objective actor for action evaluation, forming Double - Actor - based Clipped Double Q - Learning (DA - CDQ). 3. **Convergence Proof**: Provides convergence proofs of TDDR in stochastic update and simultaneous update modes, and verifies its performance through numerical benchmark comparisons. ### Key Technical Features: - **Double - Actor - Critic Structure**: TDDR uses two actors and two critics, with each actor paired with a critic, fully exploiting the advantages of double critics. - **Temporal - Difference Error - Driven Regularization**: TDDR guides the choice of critic updates through the temporal - difference error of the target network, thereby avoiding over - estimation. - **No Additional Hyper - Parameters**: Unlike existing double - actor - critic algorithms, TDDR does not introduce additional hyper - parameters, simplifying the design and implementation process. ### Comparison with Other Algorithms: - **DDPG**: Only uses one actor and one critic and is prone to over - estimation. - **TD3**: Although it introduces double critics, it still uses a single critic for update and action selection. - **DARC**, **SD3**, **GD3**: Although these algorithms also use the double - actor - critic structure, they introduce additional hyper - parameters, increasing the complexity of parameter tuning and the instability of performance. ### Conclusion: By introducing the temporal - difference error - driven regularization method, TDDR effectively improves the value - estimation accuracy and performance in continuous - control tasks without increasing additional hyper - parameters. This makes TDDR simpler and more reliable in practical applications.