Abstract:We study policy optimization algorithms for computing correlated equilibria in multi-player general-sum Markov Games. Previous results achieve $O(T^{-1/2})$ convergence rate to a correlated equilibrium and an accelerated $O(T^{-3/4})$ convergence rate to the weaker notion of coarse correlated equilibrium. In this paper, we improve both results significantly by providing an uncoupled policy optimization algorithm that attains a near-optimal $\tilde{O}(T^{-1})$ convergence rate for computing a correlated equilibrium. Our algorithm is constructed by combining two main elements (i) smooth value updates and (ii) the optimistic-follow-the-regularized-leader algorithm with the log barrier regularizer.

What problem does this paper attempt to address?

This paper attempts to solve the problem of strategic optimization algorithms for computing correlated equilibria in multi - player general - sum and Markov games. Specifically, the authors hope to improve the existing convergence speed to achieve a better approximate correlated equilibrium. ### Problem Background In a multi - agent system, when each agent independently updates its strategy according to its own utility, will the system converge to an equilibrium state? If so, how fast is the convergence? These questions have been at the core of game theory, economics, and learning theory, and have inspired decades of research. For example, in a normal - form game, when each agent uses a standard online learning algorithm with low external regret or low swap regret, the empirical distribution of their joint strategies will converge to the coarse - correlated equilibrium (CCE) or the correlated equilibrium (CE), respectively. However, for more general Markov game settings, achieving similar results is much more difficult. Previous work has shown that achieving $o(T)$ regret in Markov games is both statistically and computationally infeasible. Therefore, most existing algorithms aim to directly find approximate equilibria. The current state - of - the - art decoupled learning dynamics algorithm converges to CCE at a rate of $T^{-3/4}$ and to CE at a rate of $T^{-1/2}$ given the reward and transition functions of a Markov game, both of which are significantly slower than the $O(T^{-1})$ rate in normal - form games. ### Research Objectives The goal of this paper is to bridge this gap by proposing a new decoupled strategy optimization algorithm that can reach CE (and thus also the weaker CCE) with a near - optimal $\tilde{O}(T^{-1})$ convergence rate. This significantly improves the existing results. ### Main Contributions 1. **Improved Convergence Speed**: The authors propose a new strategy optimization algorithm that can compute correlated equilibria in multi - player general - sum and Markov games with a near - optimal $\tilde{O}(T^{-1})$ convergence rate. 2. **Combination of Two Main Techniques**: - **Smooth Value Update**: Similar to the method of Zhang et al. (2022), it ensures conservative updates of the value function, thereby stabilizing strategy updates. - **Optimistic Follow - the - Regularized - Leader (OFTRL) Algorithm**: Using the logarithmic barrier as a regularizer, a technique introduced from the latest work of Anagnostides et al. (2022b). ### Significance of the Results This research not only improves the efficiency of computing correlated equilibria in Markov games but also shows how to design efficient multi - agent learning algorithms by combining the latest online optimization techniques and regularization methods. This is of great significance for understanding and optimizing complex multi - agent systems. ### Formula Summary - Learning rate for smooth value update: $\alpha_t=\frac{H + 1}{H + t}$ - Weighted swap regret: $\text{reg}_t^{i,h}(s):=\max_{\phi_i}\sum_{j = 1}^t\alpha_j^t\langle Q_j^{i,h}(s,\cdot),((\phi_i\lozenge\pi_j^{i,h})\odot\pi_j^{-i,h})(\cdot|s)-\pi_j^h(\cdot|s)\rangle$ - Final convergence bound: $\text{CEGap}(\hat{\pi}_T)\leq8192H^{3.5}nA_{\max}^3\cdot\frac{(\log T)^2}{T}$ These formulas and methods together form the core content of this research, showing how to achieve efficient correlated equilibrium computation in multi - player general - sum and Markov games.

Near-Optimal Policy Optimization for Correlated Equilibrium in General-Sum Markov Games

Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov Games

Optimal Correlated Equilibria in General-Sum Extensive-Form Games: Fixed-Parameter Algorithms, Hardness, and Two-Sided Column-Generation

Robust optimal policies for team Markov games

Empirical Policy Optimization for n-Player Markov Games

A Coupled Optimization Framework for Correlated Equilibria in Normal-Form Game

Leveraging Team Correlation for Approximating Equilibrium in Two-Team Zero-Sum Games

Optimistic Policy Gradient in Multi-Player Markov Games with a Single Controller: Convergence Beyond the Minty Property

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games

Policy Optimization finds Nash Equilibrium in Regularized General-Sum LQ Games

Fast swap regret minimization and applications to approximate correlated equilibria

Faster Algorithms for Optimal Ex-Ante Coordinated Collusive Strategies in Extensive-Form Zero-Sum Games

Doubly Optimal No-Regret Learning in Monotone Games

A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence

When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

Provably Fast Convergence of Independent Natural Policy Gradient for Markov Potential Games

Learning Correlated Equilibria in Mean-Field Games

Iteratively Regularized Gradient Tracking Methods for Optimal Equilibrium Seeking

Corrupted Learning Dynamics in Games

Policy Iteration for Pareto-Optimal Policies in Stochastic Stackelberg Games