An Empirical Study on Google Research Football Multi-agent Scenarios

Yan Song,He Jiang,Zheng Tian,Haifeng Zhang,Yingping Zhang,Jiangcheng Zhu,Zonghong Dai,Weinan Zhang,Jun Wang
DOI: https://doi.org/10.1007/s11633-023-1426-8
2023-05-16
Abstract:Few multi-agent reinforcement learning (MARL) research on Google Research Football (GRF) focus on the 11v11 multi-agent full-game scenario and to the best of our knowledge, no open benchmark on this scenario has been released to the public. In this work, we fill the gap by providing a population-based MARL training pipeline and hyperparameter settings on multi-agent football scenario that outperforms the bot with difficulty 1.0 from scratch within 2 million steps. Our experiments serve as a reference for the expected performance of Independent Proximal Policy Optimization (IPPO), a state-of-the-art multi-agent reinforcement learning algorithm where each agent tries to maximize its own policy independently across various training configurations. Meanwhile, we open-source our training framework Light-MALib which extends the MALib codebase by distributed and asynchronized implementation with additional analytical tools for football games. Finally, we provide guidance for building strong football AI with population-based training and release diverse pretrained policies for benchmarking. The goal is to provide the community with a head start for whoever experiment their works on GRF and a simple-to-use population-based training framework for further improving their agents through self-play. The implementation is available at <a class="link-external link-https" href="https://github.com/Shanghai-Digital-Brain-Laboratory/DB-Football" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Multiagent Systems
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the lack of research and benchmarking in multi - agent reinforcement learning (MARL) in the 11 - v - 11 multi - agent full - field game scenario of Google Research Football (GRF). Specifically: 1. **Lack of public benchmarks**: As far as the authors know, there were no previously public benchmarks for the 11 - v - 11 multi - agent full - field game scenario. 2. **High training difficulty**: Due to sparse rewards, long game times, high randomness in state transitions, and issues such as role or credit assignment, training multi - agent systems in this complex scenario is extremely challenging. To solve these problems, the authors proposed the following methods: - **Provided a population - based MARL training pipeline**: Through this method, the authors were able to train a model from scratch within 2 million steps that outperforms the built - in AI (difficulty 1.0). - **Open - sourced the training framework Light - MALib**: This framework extends the MALib codebase, enables distributed and asynchronous training, and provides additional analysis tools, especially suitable for football games. - **Released diverse pre - trained strategies**: These strategies can serve as good initializations or baselines for future research. In addition, the authors also conducted extensive experiments, compared different training configurations, and provided technical suggestions on how to further improve the AI through self - play. Overall, this research aims to provide a good starting point for the community to conduct experiments on GRF and an easy - to - use population - based training framework to further enhance the capabilities of agents. ### Formula Summary The formulas involved in this paper are mainly used to describe the Independent Proximal Policy Optimization (IPPO) algorithm and its loss function: 1. **Objective function**: \[ \theta \leftarrow \arg\max_\theta \mathcal{J}(\theta) = \mathbb{E}_{a_t, s_t}\left[\sum_{t} \gamma^t R(s_t, a_t)\right] \] 2. **IPPO policy loss**: \[ \mathcal{L}(\theta) = \sum \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, a \sim \pi_{\theta_{\text{old}}}} \left[ \min \left( \frac{\pi_\theta(a_i|s)}{\pi_{\theta_{\text{old}}}(a_i|s)} \hat{A}_t^{n = 1}, \text{clip}\left( \frac{\pi_\theta(a_i|s)}{\pi_{\theta_{\text{old}}}(a_i|s)}, 1-\epsilon, 1+\epsilon \right) \hat{A}_t \right) \right] \] 3. **Advantage estimation**: \[ \hat{A}_t = \sum_{l = 0}^{h} (\gamma \lambda)^l \delta_{t + l} \] where, \[ \delta_t = r_t(s_t, a_t) + \gamma V_\phi(s_{t + 1}) - V_\phi(s_t) \] 4. **Value loss function**: \[ \mathcal{L}_i(\phi) = \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}} \left[ \min \left( (V_\phi(s_t) - \hat{V}_t)^2, (V_{\phi_{\text{old}}}(s_t) + \text{clip}(V_\phi(s_t) - V_{\phi_{