Abstract:Multi-agent reinforcement learning (MARL) algorithms based on trust regions (TR) have achieved significant success in numerous cooperative multi-agent tasks. These algorithms restrain the Kullback-Leibler (KL) divergence (i.e., TR constraint) between the current and new policies to avoid aggressive update steps and improve learning performance. However, the majority of existing TR-based MARL algorithms are on-policy, meaning that they require new data sampled by current policies for training and cannot utilize off-policy (or historical) data, leading to low sample efficiency. This study aims to enhance the data efficiency of TR-based learning methods. To achieve this, an approximation of the original objective function is designed. In addition, it is proven that as long as the update size of the policy (measured by the KL divergence) is restricted, optimizing the designed objective function using historical data can guarantee the monotonic improvement of the original target. Building on the designed objective, a practical off-policy multi-agent stochastic policy gradient algorithm is proposed within the framework of centralized training with decentralized execution (CTDE). Additionally, policy entropy is integrated into the reward to promote exploration, and consequently, improve stability. Comprehensive experiments are conducted on a representative benchmark for multi-agent MuJoCo (MAMuJoCo), which offers a range of challenging tasks in cooperative continuous multi-agent control. The results demonstrate that the proposed algorithm outperforms all other existing algorithms by a significant margin.

MAPPG: Multi-agent Phasic Policy Gradient

MAPPO method based on attention behavior network

Off-Policy Multi-Agent Decomposed Policy Gradients

Optimistic Multi-Agent Policy Gradient

Multi-Agent Path Finding Method Based on Evolutionary Reinforcement Learning

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Settling the Variance of Multi-Agent Policy Gradients

A Policy Gradient Algorithm to Alleviate the Multi-Agent Value Overestimation Problem in Complex Environments

TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient

Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems

GAILPG: Multi-Agent Policy Gradient with Generative Adversarial Imitation Learning

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning

Boosting Weak-to-Strong Agents in Multiagent Reinforcement Learning via Balanced PPO

Path Planning in Complex Environments Using Attention-Based Deep Deterministic Policy Gradient

Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning Via Polarization Policy Gradient

An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control

Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient

Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

Unlocking the Potential of MAPPO with Asynchronous Optimization

Improving Learnt Local MAPF Policies with Heuristic Search

Preference-based experience sharing scheme for multi-agent reinforcement learning in multi-target environments