Abstract:Multi-agent reinforcement learning (MARL) algorithms based on trust regions (TR) have achieved significant success in numerous cooperative multi-agent tasks. These algorithms restrain the Kullback-Leibler (KL) divergence (i.e., TR constraint) between the current and new policies to avoid aggressive update steps and improve learning performance. However, the majority of existing TR-based MARL algorithms are on-policy, meaning that they require new data sampled by current policies for training and cannot utilize off-policy (or historical) data, leading to low sample efficiency. This study aims to enhance the data efficiency of TR-based learning methods. To achieve this, an approximation of the original objective function is designed. In addition, it is proven that as long as the update size of the policy (measured by the KL divergence) is restricted, optimizing the designed objective function using historical data can guarantee the monotonic improvement of the original target. Building on the designed objective, a practical off-policy multi-agent stochastic policy gradient algorithm is proposed within the framework of centralized training with decentralized execution (CTDE). Additionally, policy entropy is integrated into the reward to promote exploration, and consequently, improve stability. Comprehensive experiments are conducted on a representative benchmark for multi-agent MuJoCo (MAMuJoCo), which offers a range of challenging tasks in cooperative continuous multi-agent control. The results demonstrate that the proposed algorithm outperforms all other existing algorithms by a significant margin.

Health-Informed Policy Gradients for Multi-Agent Reinforcement Learning

Multi-agent cooperation through learning-aware policy gradients

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

A Policy Gradient Algorithm to Alleviate the Multi-Agent Value Overestimation Problem in Complex Environments

Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems

Credit Assignment with Meta-Policy Gradient for Multi-Agent Reinforcement Learning

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning

Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient

Counterfactual Multi-Agent Policy Gradients

Optimistic Multi-Agent Policy Gradient

Difference Advantage Estimation for Multi-Agent Policy Gradients.

Data-Based Optimal Consensus Control for Multiagent Systems With Policy Gradient Reinforcement Learning

Model-free Policy Learning with Reward Gradients

SocialGFs: Learning Social Gradient Fields for Multi-Agent Reinforcement Learning

Multi-Agent Reinforcement Learning and Genetic Policy Sharing

An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control

Multi-Agent Reinforcement Learning for Problems with Combined Individual and Team Reward

Cooperative Multi-Agent Policy Gradients with Sub-optimal Demonstration

Balancing Profit, Risk, and Sustainability for Portfolio Management

Multiagent Soft Q-Learning