Towards Skilled Population Curriculum for Multi-Agent Reinforcement Learning

Rundong Wang,Longtao Zheng,Wei Qiu,Bowei He,Bo An,Zinovi Rabinovich,Yujing Hu,Yingfeng Chen,Tangjie Lv,Changjie Fan
DOI: https://doi.org/10.48550/arXiv.2302.03429
2023-02-07
Abstract:Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a teacher (curriculum generator). Despite its success, ACL's applicability is limited by (1) the lack of a general student framework for dealing with the varying number of agents across tasks and the sparse reward problem, and (2) the non-stationarity of the teacher's task due to ever-changing student strategies. As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination. Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies, enabling a team of agents to change its size while still retaining previously acquired skills. We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem and provide a corresponding regret bound. Empirical results show that our method improves the performance, scalability and sample efficiency in several MARL environments.
Artificial Intelligence,Machine Learning,Multiagent Systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively improve the policy learning effect of large - scale multi - agent systems in sparse - reward environments in Multi - Agent Reinforcement Learning (MARL). Specifically, the paper points out that current MARL algorithms face two main challenges when dealing with large - scale multi - agent systems: 1. **Scalability and Sparse - Reward Problem**: As the number of agents increases, the joint observation - action space grows exponentially, which makes it difficult to learn effective policies. In addition, sparse reward signals require a large number of training trajectories, which poses an obstacle when applying existing MARL algorithms in complex environments. 2. **Limitations of Automatic Curriculum Learning (ACL)**: Although ACL helps agents learn by gradually increasing task difficulty, its applicability is limited as follows: - There is a lack of a general student framework to handle the variation in the number of agents in different tasks and the sparse - reward problem. - The task of the teacher is non - stationary because the students' policies are constantly changing. To solve these problems, the paper proposes a new automatic curriculum learning framework - Skilled Population Curriculum (SPC), which aims to adapt to multi - agent coordinated learning. The main contributions of SPC include: - **Population - Invariant Communication**: The student module is endowed with population - invariant communication capabilities and can handle the variation in the number of agents in different tasks. - **Hierarchical Skill Set**: The student module also has a hierarchical skill set and can learn cooperation and behavioral skills from different tasks. - **Contextual Multi - Armed Bandit Teacher**: The teacher is modeled as a contextual multi - armed bandit based on the students' policies and can retain previously acquired skills while the team size changes. Through these designs, SPC aims to improve the performance, scalability, and sample efficiency of multi - agent systems, especially in sparse - reward environments. Experimental results show that SPC exhibits superior performance in multiple MARL environments.