Abstract:To obtain better value estimation in reinforcement learning, we propose a novel algorithm based on the double actor-critic framework with temporal difference error-driven regularization, abbreviated as TDDR. TDDR employs double actors, with each actor paired with a critic, thereby fully leveraging the advantages of double critics. Additionally, TDDR introduces an innovative critic regularization architecture. Compared to classical deterministic policy gradient-based algorithms that lack a double actor-critic structure, TDDR provides superior estimation. Moreover, unlike existing algorithms with double actor-critic frameworks, TDDR does not introduce any additional hyperparameters, significantly simplifying the design and implementation process. Experiments demonstrate that TDDR exhibits strong competitiveness compared to benchmark algorithms in challenging continuous control tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to obtain better value estimation in reinforcement learning. Specifically, the paper proposes a Temporal - Difference Error - Driven Regularization (TDDR) algorithm based on the double - actor - critic framework, aiming to overcome the over - estimation bias problem in existing algorithms and improve performance in continuous - control tasks without introducing additional hyper - parameters. ### Main Contributions: 1. **Performance Improvement**: TDDR performs excellently in challenging continuous - control tasks, outperforming the benchmark algorithms without introducing additional hyper - parameters. 2. **Innovative Regularization Method**: TDDR introduces a temporal - difference error - driven regularization method in the double - actor - critic framework, further utilizing the double - objective actor for action evaluation, forming Double - Actor - based Clipped Double Q - Learning (DA - CDQ). 3. **Convergence Proof**: Provides convergence proofs of TDDR in stochastic update and simultaneous update modes, and verifies its performance through numerical benchmark comparisons. ### Key Technical Features: - **Double - Actor - Critic Structure**: TDDR uses two actors and two critics, with each actor paired with a critic, fully exploiting the advantages of double critics. - **Temporal - Difference Error - Driven Regularization**: TDDR guides the choice of critic updates through the temporal - difference error of the target network, thereby avoiding over - estimation. - **No Additional Hyper - Parameters**: Unlike existing double - actor - critic algorithms, TDDR does not introduce additional hyper - parameters, simplifying the design and implementation process. ### Comparison with Other Algorithms: - **DDPG**: Only uses one actor and one critic and is prone to over - estimation. - **TD3**: Although it introduces double critics, it still uses a single critic for update and action selection. - **DARC**, **SD3**, **GD3**: Although these algorithms also use the double - actor - critic structure, they introduce additional hyper - parameters, increasing the complexity of parameter tuning and the instability of performance. ### Conclusion: By introducing the temporal - difference error - driven regularization method, TDDR effectively improves the value - estimation accuracy and performance in continuous - control tasks without increasing additional hyper - parameters. This makes TDDR simpler and more reliable in practical applications.

Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for Deep Reinforcement Learning

Efficient Continuous Control with Double Actors and Regularized Critics

Softmax Deep Double Deterministic Policy Gradients

DSAC-T: Distributional Soft Actor-Critic with Three Refinements

Network Architecture for Optimizing Deep Deterministic Policy Gradient Algorithms

Diffusion Actor-Critic with Entropy Regulator

GRAC: Self-Guided and Self-Regularized Actor-Critic

A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement Learning with Provable Convergence

Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning

Broad Critic Deep Actor Reinforcement Learning for Continuous Control

Explorer-Actor-Critic: Better Actors for Deep Reinforcement Learning

Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

Multi-agent Actor-Critic with Time Dynamical Opponent Model

A double Actor-Critic learning system embedding improved Monte Carlo tree search

Multi-State TD Target for Model-Free Reinforcement Learning

Optimizing TD3 for 7-DOF Robotic Arm Grasping: Overcoming Suboptimality with Exploration-Enhanced Contrastive Learning

Boosting the Actor with Dual Critic

Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning

Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors