Independent RL for Cooperative-Competitive Agents: A Mean-Field Perspective

Muhammad Aneeq uz Zaman,Alec Koppel,Mathieu Laurière,Tamer Başar
2024-03-18
Abstract:We address in this paper Reinforcement Learning (RL) among agents that are grouped into teams such that there is cooperation within each team but general-sum (non-zero sum) competition across different teams. To develop an RL method that provably achieves a Nash equilibrium, we focus on a linear-quadratic structure. Moreover, to tackle the non-stationarity induced by multi-agent interactions in the finite population setting, we consider the case where the number of agents within each team is infinite, i.e., the mean-field setting. This results in a General-Sum LQ Mean-Field Type Game (GS-MFTGs). We characterize the Nash equilibrium (NE) of the GS-MFTG, under a standard invertibility condition. This MFTG NE is then shown to be $\mathcal{O}(1/M)$-NE for the finite population game where $M$ is a lower bound on the number of agents in each team. These structural results motivate an algorithm called Multi-player Receding-horizon Natural Policy Gradient (MRPG), where each team minimizes its cumulative cost independently in a receding-horizon manner. Despite the non-convexity of the problem, we establish that the resulting algorithm converges to a global NE through a novel problem decomposition into sub-problems using backward recursive discrete-time Hamilton-Jacobi-Isaacs (HJI) equations, in which independent natural policy gradient is shown to exhibit linear convergence under time-independent diagonal dominance. Experiments illuminate the merits of this approach in practice.
Machine Learning,Artificial Intelligence,Computer Science and Game Theory,Multiagent Systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve Nash Equilibrium (NE) in a multi - agent system when agents are divided into multiple teams, with cooperation within each team but general - sum (non - zero - sum) competition between different teams. Specifically: 1. **Problem Background**: - Multi - Agent Reinforcement Learning (MARL) has become increasingly popular in dealing with sequential decision - making problems among agents. - In a purely cooperative environment, many algorithms and performance guarantees have been developed, but in environments where agent goals may be opposed (such as traffic congestion, financial markets, market negotiations, etc.), relatively little research has been done. - Finding Nash equilibrium strategies in general - sum stochastic games is usually an NP - hard problem. 2. **Research Objectives**: - The authors studied the Cooperative - Competitive (CC) team setting and attempted to understand the conditions for achieving Nash equilibrium in this setting. - Specifically, they hoped to find a data - driven method to achieve general - sum Nash equilibrium in CC games. 3. **Methodology**: - To make the problem solvable, the authors made two structural assumptions: - The dynamics of agents are linear and the cost is quadratic (i.e., the linear - quadratic, LQ setting). - The number of agents in each team tends to infinity, so that its Mean - Field (MF) limit approximation can be used. - This setting results in a General - Sum LQ Mean - Field Type Game (GS - MFTG). 4. **Main Contributions**: - The authors formalized the CC game in the finite - agent LQ framework and derived its mean - field approximation as MFTG. This approximation introduced an O(1/M) deviation, where M is the minimum number of agents in any team. - They developed a Multi - player Receding - horizon Natural Policy Gradient (MRPG) algorithm to learn the NE of GS - MFTG. - By decomposing simpler time - step sub - problems, the MRPG algorithm converges to the global NE at a linear rate under the time - independent diagonally dominant condition. In summary, this paper aims to solve the Nash equilibrium problem in a multi - agent system environment where cooperation and competition coexist, proposes a data - driven method based on mean - field theory, and proves the effectiveness and convergence of this method.