Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

Zida Wu,Mathieu Lauriere,Samuel Jia Cong Chua,Matthieu Geist,Olivier Pietquin,Ankur Mehta
2024-03-06
Abstract:Mean Field Games (MFGs) have the ability to handle large-scale multi-agent systems, but learning Nash equilibria in MFGs remains a challenging task. In this paper, we propose a deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and Online Mirror Descent. Through the design of an additional inner-loop replay buffer, the agents can effectively learn to achieve Nash equilibrium from any distribution, mitigating catastrophic forgetting. The resulting policy can be applied to various initial distributions. Numerical experiments on four canonical examples demonstrate our algorithm has better convergence properties than SOTA algorithms, in particular a DRL version of Fictitious Play for population-dependent policies.
Computer Science and Game Theory,Machine Learning,Multiagent Systems,Systems and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to effectively learn Nash equilibrium strategies that depend on population distributions in Mean Field Games (MFGs). Specifically, existing methods face two main challenges when dealing with large - scale multi - agent systems: 1. **Computational complexity**: Many existing algorithms need to calculate the best - response strategy with respect to the current population distribution in each iteration step, which is very time - consuming in complex problems. 2. **Learning rate decay**: Some methods based on Fictitious Play (FP) use uniform averaging of all past population distributions when updating the population distribution, resulting in the learning rate gradually slowing down as the number of iterations increases. To solve these problems, the authors propose a new algorithm based on Deep Reinforcement Learning (DRL) and Online Mirror Descent (OMD). This algorithm can learn a "master policy", which can start from any initial distribution, enabling all players to always make decisions according to Nash equilibrium without having to relearn new strategies. By introducing an additional inner - loop replay buffer, this algorithm can effectively alleviate catastrophic forgetting, thus better adapting to different initial distributions. ### Specific problems and solutions - **Problem**: Existing methods are difficult to handle strategy learning that depends on population distributions and perform poorly when faced with multiple initial distributions. - **Solutions**: - A new DRL algorithm is proposed, which combines the ideas of Munchausen RL and OMD. - A special replay buffer design is used to ensure that the algorithm can learn Nash equilibrium strategies from any initial distribution. - The superior performance of this algorithm on four classic examples is verified through numerical experiments, especially in terms of convergence speed and stability, which are better than existing methods. ### Mathematical formula representation To understand the core idea of the algorithm more clearly, the following are the key formulas: - **Q - function update**: \[ T_n = r_n^k + L_n^k + \gamma \sum_{a_{n + 1}} \pi_k^{\theta'}(a_{n + 1} | s_n^{k + 1}) \left[ \tilde{Q}_k^{\theta'}(s_{n + 1}^k, a_{n + 1}) - L_{n + 1}^k \right] \] where \( r_n^k = r(x_n, a_n, \mu_n^k) \), \( s_n^k = (n, x_n, \mu_n^k) \), \( L_n^k = \tau \log \pi_{k - 1}^\theta(a_n | s_n^k) \). - **Strategy update**: \[ \pi_k(\cdot | n, x, \mu) = \text{softmax}\left( \frac{1}{\tau} \tilde{Q}_\theta(n, x, \mu, \cdot) \right) \] These formulas show how to minimize the loss function through deep network training and update the strategy to approximate Nash equilibrium. In conclusion, this paper aims to solve the problem of learning Nash equilibrium strategies that depend on population distributions in MFGs, and proposes a new DRL algorithm. Through the improved OMD and replay buffer design, better convergence and adaptability are achieved.