Near-Optimal Policy Optimization for Correlated Equilibrium in General-Sum Markov Games

Yang Cai,Haipeng Luo,Chen-Yu Wei,Weiqiang Zheng
2024-05-02
Abstract:We study policy optimization algorithms for computing correlated equilibria in multi-player general-sum Markov Games. Previous results achieve $O(T^{-1/2})$ convergence rate to a correlated equilibrium and an accelerated $O(T^{-3/4})$ convergence rate to the weaker notion of coarse correlated equilibrium. In this paper, we improve both results significantly by providing an uncoupled policy optimization algorithm that attains a near-optimal $\tilde{O}(T^{-1})$ convergence rate for computing a correlated equilibrium. Our algorithm is constructed by combining two main elements (i) smooth value updates and (ii) the optimistic-follow-the-regularized-leader algorithm with the log barrier regularizer.
Machine Learning,Computer Science and Game Theory,Optimization and Control
What problem does this paper attempt to address?
This paper attempts to solve the problem of strategic optimization algorithms for computing correlated equilibria in multi - player general - sum and Markov games. Specifically, the authors hope to improve the existing convergence speed to achieve a better approximate correlated equilibrium. ### Problem Background In a multi - agent system, when each agent independently updates its strategy according to its own utility, will the system converge to an equilibrium state? If so, how fast is the convergence? These questions have been at the core of game theory, economics, and learning theory, and have inspired decades of research. For example, in a normal - form game, when each agent uses a standard online learning algorithm with low external regret or low swap regret, the empirical distribution of their joint strategies will converge to the coarse - correlated equilibrium (CCE) or the correlated equilibrium (CE), respectively. However, for more general Markov game settings, achieving similar results is much more difficult. Previous work has shown that achieving \(o(T)\) regret in Markov games is both statistically and computationally infeasible. Therefore, most existing algorithms aim to directly find approximate equilibria. The current state - of - the - art decoupled learning dynamics algorithm converges to CCE at a rate of \(T^{-3/4}\) and to CE at a rate of \(T^{-1/2}\) given the reward and transition functions of a Markov game, both of which are significantly slower than the \(O(T^{-1})\) rate in normal - form games. ### Research Objectives The goal of this paper is to bridge this gap by proposing a new decoupled strategy optimization algorithm that can reach CE (and thus also the weaker CCE) with a near - optimal \(\tilde{O}(T^{-1})\) convergence rate. This significantly improves the existing results. ### Main Contributions 1. **Improved Convergence Speed**: The authors propose a new strategy optimization algorithm that can compute correlated equilibria in multi - player general - sum and Markov games with a near - optimal \(\tilde{O}(T^{-1})\) convergence rate. 2. **Combination of Two Main Techniques**: - **Smooth Value Update**: Similar to the method of Zhang et al. (2022), it ensures conservative updates of the value function, thereby stabilizing strategy updates. - **Optimistic Follow - the - Regularized - Leader (OFTRL) Algorithm**: Using the logarithmic barrier as a regularizer, a technique introduced from the latest work of Anagnostides et al. (2022b). ### Significance of the Results This research not only improves the efficiency of computing correlated equilibria in Markov games but also shows how to design efficient multi - agent learning algorithms by combining the latest online optimization techniques and regularization methods. This is of great significance for understanding and optimizing complex multi - agent systems. ### Formula Summary - Learning rate for smooth value update: \(\alpha_t=\frac{H + 1}{H + t}\) - Weighted swap regret: \(\text{reg}_t^{i,h}(s):=\max_{\phi_i}\sum_{j = 1}^t\alpha_j^t\langle Q_j^{i,h}(s,\cdot),((\phi_i\lozenge\pi_j^{i,h})\odot\pi_j^{-i,h})(\cdot|s)-\pi_j^h(\cdot|s)\rangle\) - Final convergence bound: \(\text{CEGap}(\hat{\pi}_T)\leq8192H^{3.5}nA_{\max}^3\cdot\frac{(\log T)^2}{T}\) These formulas and methods together form the core content of this research, showing how to achieve efficient correlated equilibrium computation in multi - player general - sum and Markov games.