Revisiting the Gumbel-Softmax in MADDPG

Callum Rhys Tilbury,Filippos Christianos,Stefano V. Albrecht
2023-06-14
Abstract:MADDPG is an algorithm in multi-agent reinforcement learning (MARL) that extends the popular single-agent method, DDPG, to multi-agent scenarios. Importantly, DDPG is an algorithm designed for continuous action spaces, where the gradient of the state-action value function exists. For this algorithm to work in discrete action spaces, discrete gradient estimation must be performed. For MADDPG, the Gumbel-Softmax (GS) estimator is used -- a reparameterisation which relaxes a discrete distribution into a similar continuous one. This method, however, is statistically biased, and a recent MARL benchmarking paper suggests that this bias makes MADDPG perform poorly in grid-world situations, where the action space is discrete. Fortunately, many alternatives to the GS exist, boasting a wide range of properties. This paper explores several of these alternatives and integrates them into MADDPG for discrete grid-world scenarios. The corresponding impact on various performance metrics is then measured and analysed. It is found that one of the proposed estimators performs significantly better than the original GS in several tasks, achieving up to 55% higher returns, along with faster convergence.
Machine Learning,Artificial Intelligence,Multiagent Systems
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily explores how to improve the performance of the MADDPG algorithm in discrete action spaces within multi-agent reinforcement learning (MARL). Specifically: 1. **Background of the MADDPG Algorithm**: - The MADDPG algorithm is an extension of the single-agent DDPG algorithm to multi-agent environments, suitable for continuous action spaces. - For discrete action spaces, MADDPG uses the Gumbel-Softmax (GS) method for discrete gradient estimation, but this method has statistical bias. 2. **Research Motivation**: - A recent benchmark indicates that MADDPG performs poorly in discrete action spaces (such as grid world environments), possibly due to the statistical bias introduced by GS. - To overcome this issue, the paper explores several alternative discrete gradient estimation methods and integrates them into MADDPG to evaluate their impact on performance. 3. **Experimental Design**: - The researchers selected four alternative methods: two simple improvements to the existing GS method (lowering temperature and annealing temperature), and two new methods from the literature (Gumbel-Rao Monte Carlo and Gapped Straight-Through). - Tests were conducted on nine grid world tasks, and the impact of these methods on various performance metrics was analyzed. 4. **Main Findings**: - Experimental results show that the Gapped Straight-Through (GST) method significantly outperforms the original Gumbel-Softmax method, with a maximum reward increase of 55% and faster convergence. - This indicates that improving discrete gradient estimation methods can significantly enhance the performance of MADDPG in discrete action spaces. ### Summary This paper aims to improve the performance of the MADDPG algorithm in discrete action spaces by enhancing discrete gradient estimation methods, particularly in grid world environments. The research results indicate that the Gapped Straight-Through method is one of the most effective improvements.