Abstract:Multi-agent reinforcement learning (MARL) algorithms based on trust regions (TR) have achieved significant success in numerous cooperative multi-agent tasks. These algorithms restrain the Kullback-Leibler (KL) divergence (i.e., TR constraint) between the current and new policies to avoid aggressive update steps and improve learning performance. However, the majority of existing TR-based MARL algorithms are on-policy, meaning that they require new data sampled by current policies for training and cannot utilize off-policy (or historical) data, leading to low sample efficiency. This study aims to enhance the data efficiency of TR-based learning methods. To achieve this, an approximation of the original objective function is designed. In addition, it is proven that as long as the update size of the policy (measured by the KL divergence) is restricted, optimizing the designed objective function using historical data can guarantee the monotonic improvement of the original target. Building on the designed objective, a practical off-policy multi-agent stochastic policy gradient algorithm is proposed within the framework of centralized training with decentralized execution (CTDE). Additionally, policy entropy is integrated into the reward to promote exploration, and consequently, improve stability. Comprehensive experiments are conducted on a representative benchmark for multi-agent MuJoCo (MAMuJoCo), which offers a range of challenging tasks in cooperative continuous multi-agent control. The results demonstrate that the proposed algorithm outperforms all other existing algorithms by a significant margin.

Decentralized Multi-Task Reinforcement Learning Policy Gradient Method with Momentum over Networks.

A Distributed Adaptive Policy Gradient Method Based on Momentum for Multi-Agent Reinforcement Learning

Dueling Network Architecture for Multi-Agent Deep Deterministic Policy Gradient

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Multi-Agent Deep Deterministic Policy Gradient Algorithm Based on Classification Experience Replay

Fast Stochastic Policy Gradient: Negative Momentum for Reinforcement Learning

A Decentralized Policy Gradient Approach to Multi-task Reinforcement Learning

Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning

Ε-Maximum Critic Deep Deterministic Policy Gradient for Multi-agent Reinforcement Learning

Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning

Mixed Policy Gradient: off-policy reinforcement learning driven jointly by data and model

Improved Communication Efficiency in Federated Natural Policy Gradient via ADMM-based Gradient Updates

Twin Delayed Multi-Agent Deep Deterministic Policy Gradient

Multi-critic DDPG Method and Double Experience Replay

Off-Policy Multi-Agent Decomposed Policy Gradients

Network Architecture for Optimizing Deep Deterministic Policy Gradient Algorithms

Multi-Agent Distributed Deep Deterministic Policy Gradient for Partially Observable Tracking

Long Short-Term Deterministic Policy Gradient for Joint Optimization of Computational Offloading and Resource Allocation in MEC.

An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control

Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning