Abstract:Multi-agent reinforcement learning (MARL) algorithms based on trust regions (TR) have achieved significant success in numerous cooperative multi-agent tasks. These algorithms restrain the Kullback-Leibler (KL) divergence (i.e., TR constraint) between the current and new policies to avoid aggressive update steps and improve learning performance. However, the majority of existing TR-based MARL algorithms are on-policy, meaning that they require new data sampled by current policies for training and cannot utilize off-policy (or historical) data, leading to low sample efficiency. This study aims to enhance the data efficiency of TR-based learning methods. To achieve this, an approximation of the original objective function is designed. In addition, it is proven that as long as the update size of the policy (measured by the KL divergence) is restricted, optimizing the designed objective function using historical data can guarantee the monotonic improvement of the original target. Building on the designed objective, a practical off-policy multi-agent stochastic policy gradient algorithm is proposed within the framework of centralized training with decentralized execution (CTDE). Additionally, policy entropy is integrated into the reward to promote exploration, and consequently, improve stability. Comprehensive experiments are conducted on a representative benchmark for multi-agent MuJoCo (MAMuJoCo), which offers a range of challenging tasks in cooperative continuous multi-agent control. The results demonstrate that the proposed algorithm outperforms all other existing algorithms by a significant margin.

Natural Policy Gradient and Actor Critic Methods for Constrained Multi-Task Reinforcement Learning

Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning

A Decentralized Policy Gradient Approach to Multi-task Reinforcement Learning

Leveraging the Efficiency of Multi-Task Robot Manipulation Via Task-Evoked Planner and Reinforcement Learning

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Efficient Multi-Task Reinforcement Learning via Task-Specific Action Correction

A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Multi-Task Policy Search

AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers

DiGrad: Multi-Task Reinforcement Learning with Shared Actions

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

Actor-Critic Reinforcement Learning with Phased Actor

Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification

Independent RL for Cooperative-Competitive Agents: A Mean-Field Perspective

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Modelling the Dynamic Joint Policy of Teammates with Attention Multi-agent DDPG

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning

Counterfactual Multi-Agent Policy Gradients

Multi-agent cooperation through learning-aware policy gradients

Reducing Overestimation Bias in Multi-Agent Domains Using Double Centralized Critics

An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control