Abstract:Multi-agent reinforcement learning (MARL) algorithms based on trust regions (TR) have achieved significant success in numerous cooperative multi-agent tasks. These algorithms restrain the Kullback-Leibler (KL) divergence (i.e., TR constraint) between the current and new policies to avoid aggressive update steps and improve learning performance. However, the majority of existing TR-based MARL algorithms are on-policy, meaning that they require new data sampled by current policies for training and cannot utilize off-policy (or historical) data, leading to low sample efficiency. This study aims to enhance the data efficiency of TR-based learning methods. To achieve this, an approximation of the original objective function is designed. In addition, it is proven that as long as the update size of the policy (measured by the KL divergence) is restricted, optimizing the designed objective function using historical data can guarantee the monotonic improvement of the original target. Building on the designed objective, a practical off-policy multi-agent stochastic policy gradient algorithm is proposed within the framework of centralized training with decentralized execution (CTDE). Additionally, policy entropy is integrated into the reward to promote exploration, and consequently, improve stability. Comprehensive experiments are conducted on a representative benchmark for multi-agent MuJoCo (MAMuJoCo), which offers a range of challenging tasks in cooperative continuous multi-agent control. The results demonstrate that the proposed algorithm outperforms all other existing algorithms by a significant margin.

PRAG: Periodic Regularized Action Gradient for Efficient Continuous Control

Self-play Reinforcement Learning with Comprehensive Critic in Computer Games

CGAR: Critic Guided Action Redistribution in Reinforcement Leaning

Actor-Critic Reinforcement Learning with Phased Actor

GRAC: Self-Guided and Self-Regularized Actor-Critic

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

Stochastic Cubic-Regularized Policy Gradient Method

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Efficient Continuous Control with Double Actors and Regularized Critics

An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Continuous control with deep reinforcement learning

Compatible Gradient Approximations for Actor-Critic Algorithms

Gradient-based Regularization for Action Smoothness in Robotic Control with Reinforcement Learning

Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Network Architecture for Optimizing Deep Deterministic Policy Gradient Algorithms

Unified Policy Optimization for Continuous-action Reinforcement Learning in Non-stationary Tasks and Games