Abstract:We study the exploration-exploitation trade-off for large multiplayer coordination games where players strategise via Q-Learning, a common learning framework in multi-agent reinforcement learning. Q-Learning is known to have two shortcomings, namely non-convergence and potential equilibrium selection problems, when there are multiple fixed points, called Quantal Response Equilibria (QRE). Furthermore, whilst QRE have full support for finite games, it is not clear how Q-Learning behaves as the game becomes large. In this paper, we characterise the critical exploration rate that guarantees convergence to a unique fixed point, addressing the two shortcomings above. Using a generating-functional method, we show that this rate increases with the number of players and the alignment of their payoffs. For many-player coordination games with perfectly aligned payoffs, this exploration rate is roughly twice that of $p$-player zero-sum games. As for large games, we provide a structural result for QRE, which suggests that as the game size increases, Q-Learning converges to a QRE near the boundary of the simplex of the action space, a phenomenon we term asymptotic extinction, where a constant fraction of the actions are played with zero probability at a rate $o(1/N)$ for an $N$-action game.
What problem does this paper attempt to address?
This paper attempts to solve the exploration - exploitation trade - off problem when players make strategy selections through Q - Learning in large - scale multi - person coordination games. Specifically, the paper mainly focuses on the following issues:
1. **Non - convergence and equilibrium selection problems**: Q - Learning may encounter non - convergence and potential equilibrium selection problems in the presence of multiple fixed points (i.e., Quantal Response Equilibria, QRE). The paper aims to address these shortcomings.
2. **The relationship between the exploration rate and the game scale**: How does the critical exploration rate required to ensure the convergence of Q - Learning to a unique fixed point change as the number of players and the degree of payoff alignment increase? The author uses the generating function method to prove that this exploration rate increases as the number of players and the degree of payoff alignment increase.
3. **Behavioral characteristics in large - scale games**: What are the behavioral characteristics of Q - Learning for large - scale games? The paper provides a structured result, indicating that as the game scale increases, Q - Learning will converge to the QRE near the boundary of the action space simplex. This phenomenon is called "asymptotic extinction", in which some actions are selected with a probability close to zero at a rate of approximately o(1/N).
4. **Determination of the exploration rate**: In order to ensure that Q - Learning converges to a unique fixed point, it is necessary to find a minimum exploration rate \(T_{\text{crit}}\) that can adapt to different numbers of players and payoff correlations.
### Specific problem description
The paper studies how to balance exploration and exploitation when players make strategy selections through Q - Learning in large - scale multi - person coordination games. Q - Learning is a common framework in multi - agent reinforcement learning, but it has two main drawbacks: non - convergence and multiple equilibrium selection problems. Especially in the case of multiple fixed points, Q - Learning may fall into local optimal solutions or fail to converge.
To solve these problems, the paper explores the following points:
- **Critical exploration rate**: Determine a critical exploration rate \(T_{\text{crit}}\) so that Q - Learning can converge to a unique fixed point in large - scale multi - person coordination games.
- **Asymptotic extinction phenomenon**: As the game scale increases, the behavior of Q - Learning will tend to converge near the boundary of the action space, causing some actions to be hardly selected, which is called "asymptotic extinction".
### Main contributions
The main contributions of the paper include:
- Using the generating function method to analyze the dynamic behavior of Q - Learning in large - scale multi - person coordination games, especially solving the non - convergence and multiple equilibrium selection problems.
- Proposing a structured result, explaining that as the game scale increases, Q - Learning will converge to the QRE near the boundary of the action space, and introducing the "asymptotic extinction" phenomenon.
- Verifying the proposed model and conclusions through theoretical analysis and numerical simulation.
In general, by in - depth analysis of the behavior of Q - Learning in large - scale multi - person coordination games, the paper provides a new perspective to understand and optimize learning algorithms in multi - agent systems.