Generalized Individual Q-learning for Polymatrix Games with Partial Observations

Ahmed Said Donmez,Muhammed O. Sayin
2024-09-04
Abstract:This paper addresses the challenge of limited observations in non-cooperative multi-agent systems where agents can have partial access to other agents' actions. We present the generalized individual Q-learning dynamics that combine belief-based and payoff-based learning for the networked interconnections of more than two self-interested agents. This approach leverages access to opponents' actions whenever possible, demonstrably achieving a faster (guaranteed) convergence to quantal response equilibrium in multi-agent zero-sum and potential polymatrix games. Notably, the dynamics reduce to the well-studied smoothed fictitious play and individual Q-learning under full and no access to opponent actions, respectively. We further quantify the improvement in convergence rate due to observing opponents' actions through numerical simulations.
Computer Science and Game Theory,Systems and Control
What problem does this paper attempt to address?
This paper attempts to address the challenges in non - cooperative multi - agent systems due to the fact that agents can only partially obtain information about the actions of other agents. Specifically, when an agent cannot fully observe the actions of other agents, the learning process becomes complex and the convergence speed slows down. To deal with this problem, the paper proposes the generalized individual Q - learning dynamics, which combines belief - based learning and payoff - based learning methods to accelerate convergence in multi - agent zero - sum games and potential games. ### Problem Background In multi - agent systems, the interactions between agents are usually realized through network connections, and each agent may only be able to observe the actions of agents directly connected to it. This partial observability poses challenges to learning algorithms, especially in non - cooperative environments, where agents need to optimize their strategies based on limited information. ### Solution The generalized individual Q - learning dynamics proposed in the paper aims to utilize the information about the opponents' actions that agents can observe, thereby accelerating the learning process. Specifically: 1. **Combining Belief and Payoff**: This method combines belief - based learning (such as smooth virtual play) and payoff - based learning (such as individual Q - learning). When an agent can observe the opponents' actions, it uses belief - based learning; when it cannot observe, it relies on payoff - based learning. 2. **Accelerating Convergence**: By making full use of the available information about the opponents' actions, this method can converge to the quantal response equilibrium (QRE) more quickly in multi - agent zero - sum games and potential games. 3. **Theoretical Guarantee**: The paper proves that in multi - agent zero - sum games and potential games, the proposed dynamics converge almost surely to QRE, and the Q - function estimates are also updated asymptotically based on beliefs. ### Numerical Simulation Through numerical simulation, the paper further quantifies the improvement in the convergence rate due to observing the opponents' actions. The results show that as the connection probability between agents increases, the convergence speed significantly accelerates, especially in the case of full observability, where the convergence speed is the fastest. ### Summary By introducing the generalized individual Q - learning dynamics, the paper addresses the challenges brought by partial observability in multi - agent systems, providing a faster convergence speed and stronger theoretical guarantees. This provides a new perspective and tool for understanding and designing efficient multi - agent learning algorithms. ### Formula Summary - **Q - function Estimation Update Formula**: \[ q_i^{k + 1}(a_i)=q_i^k(a_i)+\alpha_i^k(a_i)\cdot(u_i^k - q_i^k(a_i)) \] - **Smooth Best Response**: \[ b_{r_i}(q):=\arg\max_{\mu\in\Delta_i}[\mu^T q+\tau H(\mu)] \] where $\tau > 0$ is the temperature parameter, and $H(\mu)=-\sum_a\mu(a)\log\mu(a)$ is the entropy regularization term. - **Definition of Quantal Response Equilibrium (QRE)**: \[ \pi_i = b_{r_i}(u_i(\cdot,\pi_{-i}))\quad\forall i\in I \] These formulas show how agents update their Q - function estimates in a partially observable environment and finally converge to an equilibrium state.