Abstract:This paper addresses the challenge of limited observations in non-cooperative multi-agent systems where agents can have partial access to other agents' actions. We present the generalized individual Q-learning dynamics that combine belief-based and payoff-based learning for the networked interconnections of more than two self-interested agents. This approach leverages access to opponents' actions whenever possible, demonstrably achieving a faster (guaranteed) convergence to quantal response equilibrium in multi-agent zero-sum and potential polymatrix games. Notably, the dynamics reduce to the well-studied smoothed fictitious play and individual Q-learning under full and no access to opponent actions, respectively. We further quantify the improvement in convergence rate due to observing opponents' actions through numerical simulations.

What problem does this paper attempt to address?

This paper attempts to address the challenges in non - cooperative multi - agent systems due to the fact that agents can only partially obtain information about the actions of other agents. Specifically, when an agent cannot fully observe the actions of other agents, the learning process becomes complex and the convergence speed slows down. To deal with this problem, the paper proposes the generalized individual Q - learning dynamics, which combines belief - based learning and payoff - based learning methods to accelerate convergence in multi - agent zero - sum games and potential games. ### Problem Background In multi - agent systems, the interactions between agents are usually realized through network connections, and each agent may only be able to observe the actions of agents directly connected to it. This partial observability poses challenges to learning algorithms, especially in non - cooperative environments, where agents need to optimize their strategies based on limited information. ### Solution The generalized individual Q - learning dynamics proposed in the paper aims to utilize the information about the opponents' actions that agents can observe, thereby accelerating the learning process. Specifically: 1. **Combining Belief and Payoff**: This method combines belief - based learning (such as smooth virtual play) and payoff - based learning (such as individual Q - learning). When an agent can observe the opponents' actions, it uses belief - based learning; when it cannot observe, it relies on payoff - based learning. 2. **Accelerating Convergence**: By making full use of the available information about the opponents' actions, this method can converge to the quantal response equilibrium (QRE) more quickly in multi - agent zero - sum games and potential games. 3. **Theoretical Guarantee**: The paper proves that in multi - agent zero - sum games and potential games, the proposed dynamics converge almost surely to QRE, and the Q - function estimates are also updated asymptotically based on beliefs. ### Numerical Simulation Through numerical simulation, the paper further quantifies the improvement in the convergence rate due to observing the opponents' actions. The results show that as the connection probability between agents increases, the convergence speed significantly accelerates, especially in the case of full observability, where the convergence speed is the fastest. ### Summary By introducing the generalized individual Q - learning dynamics, the paper addresses the challenges brought by partial observability in multi - agent systems, providing a faster convergence speed and stronger theoretical guarantees. This provides a new perspective and tool for understanding and designing efficient multi - agent learning algorithms. ### Formula Summary - **Q - function Estimation Update Formula**: \[ q_i^{k + 1}(a_i)=q_i^k(a_i)+\alpha_i^k(a_i)\cdot(u_i^k - q_i^k(a_i)) \] - **Smooth Best Response**: \[ b_{r_i}(q):=\arg\max_{\mu\in\Delta_i}[\mu^T q+\tau H(\mu)] \] where $\tau > 0$ is the temperature parameter, and $H(\mu)=-\sum_a\mu(a)\log\mu(a)$ is the entropy regularization term. - **Definition of Quantal Response Equilibrium (QRE)**: \[ \pi_i = b_{r_i}(u_i(\cdot,\pi_{-i}))\quad\forall i\in I \] These formulas show how agents update their Q - function estimates in a partially observable environment and finally converge to an equilibrium state.

Generalized Individual Q-learning for Polymatrix Games with Partial Observations

Asymptotic Convergence and Performance of Multi-Agent Q-Learning Dynamics

Beyond Strict Competition: Approximate Convergence of Multi Agent Q-Learning Dynamics

Guarantees for Self-Play in Multiplayer Games via Polymatrix Decomposability

Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov Games

A Generalized Training Approach for Multiagent Learning

Polymatrix Competitive Gradient Descent

Multiagent Soft Q-Learning

Multi-agent Reinforcement Learning in Sequential Social Dilemmas

LOQA: Learning with Opponent Q-Learning Awareness

Reinforcement Learning In Two Player Zero Sum Simultaneous Action Games

On the Approximation of Nash Equilibria in Sparse Win-Lose Multi-player Games

Efficient off‐policy Q‐learning for multi‐agent systems by solving dual games

Learning Sparse Polymatrix Games in Polynomial Time and Sample Complexity

Learning in two-player games between transparent opponents

Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers

Strategizing against Q-learners: A Control-theoretical Approach

Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

FM3Q: Factorized Multi-Agent MiniMax Q-Learning for Two-Team Zero-Sum Markov Game

The Complexity of Two-Team Polymatrix Games with Independent Adversaries

Independent RL for Cooperative-Competitive Agents: A Mean-Field Perspective