Abstract:The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an $\epsilon$-optimal Nash Equilibrium (NE) with the sample complexity of $O(H^3SAB/\epsilon^2)$, which is optimal in the dependence of the horizon $H$ and the number of states $S$ (where $A$ and $B$ denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the $H$ dependence as model-based algorithms. The main improvement of the dependency on $H$ arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.

Periodicity in Hedge-myopic system and an asymmetric NE-solving paradigm for two-player zero-sum games

Nash Equilibrium in Iterated Multiplayer Games Under Asynchronous Best-Response Dynamics

The Optimal Strategy against Hedge Algorithm in Repeated Games

Last-iterate Convergence Separation between Extra-gradient and Optimism in Constrained Periodic Games

Towards convergence to Nash equilibria in two-team zero-sum games

Synchronization behind Learning in Periodic Zero-Sum Games Triggers Divergence from Nash equilibrium

Learning Nash Equilibria in Zero-Sum Markov Games: A Single Time-scale Algorithm Under Weak Reachability

Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games

Hierarchical Predefined-Time NE Seeking for Uncertain Multiplayer Noncooperative Games

On the Last-iterate Convergence in Time-varying Zero-sum Games: Extra Gradient Succeeds where Optimism Fails

Distributed Nash equilibrium seeking strategies via bilateral bounded gradient approach

No-Regret Learning in Time-Varying Zero-Sum Games

Approaching the Global Nash Equilibrium of Non-convex Multi-player Games

Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

Neural Population Learning beyond Symmetric Zero-sum Games

An Improved Two-Step Method for Solving Generalized Nash Equilibrium Problems.

Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback

Nash Equilibrium Seeking for Graphic Games With Dynamic Event-Triggered Mechanism

Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets

A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games