Multi-Agent Reinforcement Learning for Problems with Combined Individual and Team Reward

Hassam Ullah Sheikh,Ladislau Bölöni
DOI: https://doi.org/10.48550/arXiv.2003.10598
2020-03-24
Abstract:Many cooperative multi-agent problems require agents to learn individual tasks while contributing to the collective success of the group. This is a challenging task for current state-of-the-art multi-agent reinforcement algorithms that are designed to either maximize the global reward of the team or the individual local rewards. The problem is exacerbated when either of the rewards is sparse leading to unstable learning. To address this problem, we present Decomposed Multi-Agent Deep Deterministic Policy Gradient (DE-MADDPG): a novel cooperative multi-agent reinforcement learning framework that simultaneously learns to maximize the global and local rewards. We evaluate our solution on the challenging defensive escort team problem and show that our solution achieves a significantly better and more stable performance than the direct adaptation of the MADDPG algorithm.
Multiagent Systems,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how, in multi - agent reinforcement learning, agents can contribute to the overall success of the team while learning individual tasks. Currently, the state - of - the - art multi - agent reinforcement learning algorithm designs are either to maximize the overall reward of the team or to maximize the individual local rewards. When either of the reward signals is sparse, it will lead to unstable learning. To solve this problem, the authors propose Decomposed Multi - Agent Deep Deterministic Policy Gradient (DE - MADDPG): a novel cooperative multi - agent reinforcement learning framework that can simultaneously learn to maximize global and local rewards. Specifically, the paper points out that in cooperative multi - agent problems, each agent needs to strive to maximize its own gain (local reward) and the collective success of the team (global reward) simultaneously. For example, in a defense escort team, each agent must maintain a specific distance from the goods to avoid violating any social norms while not sacrificing the safety of the goods. Although multi - agent reinforcement learning (MARL) has been successful in multi - player games, learning multi - agent cooperation while simultaneously maximizing local rewards remains an open challenge. In such learning problems, agents explicitly receive two reward signals: the global reward of the team and the individual local reward of the agent. To address these challenges, the paper proposes a dual - critic framework (DE - MADDPG). By training two critics to evaluate the global reward and the local reward respectively, it avoids the need to create an entangled multi - objective reward function. This method not only improves the stability of learning but also allows the application of performance - enhancing techniques, such as Prioritized Experience Replay (PER) and Twin Delayed Deep Deterministic Policy Gradient (TD3), to solve the over - estimation bias problem in the Q - function. Experimental results show that DE - MADDPG significantly outperforms the performance of directly adapting the MADDPG algorithm on the defense escort team problem and is more stable.