Abstract:MADDPG is an algorithm in multi-agent reinforcement learning (MARL) that extends the popular single-agent method, DDPG, to multi-agent scenarios. Importantly, DDPG is an algorithm designed for continuous action spaces, where the gradient of the state-action value function exists. For this algorithm to work in discrete action spaces, discrete gradient estimation must be performed. For MADDPG, the Gumbel-Softmax (GS) estimator is used -- a reparameterisation which relaxes a discrete distribution into a similar continuous one. This method, however, is statistically biased, and a recent MARL benchmarking paper suggests that this bias makes MADDPG perform poorly in grid-world situations, where the action space is discrete. Fortunately, many alternatives to the GS exist, boasting a wide range of properties. This paper explores several of these alternatives and integrates them into MADDPG for discrete grid-world scenarios. The corresponding impact on various performance metrics is then measured and analysed. It is found that one of the proposed estimators performs significantly better than the original GS in several tasks, achieving up to 55% higher returns, along with faster convergence.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily explores how to improve the performance of the MADDPG algorithm in discrete action spaces within multi-agent reinforcement learning (MARL). Specifically: 1. **Background of the MADDPG Algorithm**: - The MADDPG algorithm is an extension of the single-agent DDPG algorithm to multi-agent environments, suitable for continuous action spaces. - For discrete action spaces, MADDPG uses the Gumbel-Softmax (GS) method for discrete gradient estimation, but this method has statistical bias. 2. **Research Motivation**: - A recent benchmark indicates that MADDPG performs poorly in discrete action spaces (such as grid world environments), possibly due to the statistical bias introduced by GS. - To overcome this issue, the paper explores several alternative discrete gradient estimation methods and integrates them into MADDPG to evaluate their impact on performance. 3. **Experimental Design**: - The researchers selected four alternative methods: two simple improvements to the existing GS method (lowering temperature and annealing temperature), and two new methods from the literature (Gumbel-Rao Monte Carlo and Gapped Straight-Through). - Tests were conducted on nine grid world tasks, and the impact of these methods on various performance metrics was analyzed. 4. **Main Findings**: - Experimental results show that the Gapped Straight-Through (GST) method significantly outperforms the original Gumbel-Softmax method, with a maximum reward increase of 55% and faster convergence. - This indicates that improving discrete gradient estimation methods can significantly enhance the performance of MADDPG in discrete action spaces. ### Summary This paper aims to improve the performance of the MADDPG algorithm in discrete action spaces by enhancing discrete gradient estimation methods, particularly in grid world environments. The research results indicate that the Gapped Straight-Through method is one of the most effective improvements.

Revisiting the Gumbel-Softmax in MADDPG

Multi-Agent Deep Deterministic Policy Gradient Algorithm Based on Classification Experience Replay

A Dynamically Adaptive Approach to Reducing Strategic Interference for Multi-agent Systems

Dueling Network Architecture for Multi-Agent Deep Deterministic Policy Gradient

Ε-Maximum Critic Deep Deterministic Policy Gradient for Multi-agent Reinforcement Learning

Research on Wargame Decision-Making Method Based on Multi-Agent Deep Deterministic Policy Gradient

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient

Softmax Deep Double Deterministic Policy Gradients

Hindsight-aware Deep Reinforcement Learning Algorithm for Multi-Agent Systems

Research on multi-UAV task decision-making based on improved MADDPG algorithm and transfer learning

A Distributed Adaptive Policy Gradient Method Based on Momentum for Multi-Agent Reinforcement Learning

Settling the Variance of Multi-Agent Policy Gradients

MADDPGViz: a visual analytics approach to understand multi-agent deep reinforcement learning

Decomposed and Prioritized Experience Replay-based MADDPG Algorithm for Multi-UAV Confrontation

Friend-or-Foe Deep Deterministic Policy Gradient

A Policy Gradient Algorithm to Alleviate the Multi-Agent Value Overestimation Problem in Complex Environments

Multi-Agent Cooperation Decision-Making by Reinforcement Learning with Encirclement Rewards

Multi-Agent Reinforcement Learning for Problems with Combined Individual and Team Reward

R-MADDPG for Partially Observable Environments and Limited Communication

Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning