Abstract:Model-based algorithms -- algorithms that explore the environment through building and utilizing an estimated model -- are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an $\epsilon$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)$ episodes of game playing, where $S$ is the number of states, $A,B$ are the number of actions for the two players respectively, and $H$ is the horizon length. This significantly improves over the best known model-based guarantee of $\tilde{\mathcal{O}}(H^4S^2AB/\epsilon^2)$, and is the first that matches the information-theoretic lower bound $\Omega(H^3S(A+B)/\epsilon^2)$ except for a $\min\{A,B\}$ factor. In addition, our guarantee compares favorably against the best known model-free algorithm if $\min \{A,B\}=o(H^3)$, and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

Differentiable Arbitrating in Zero-sum Markov Games

Neural Auto-Curricula in Two-Player Zero-Sum Games.

Deep Reinforcement Learning for Nash Equilibrium of Differential Games

Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

Minimally Modifying a Markov Game to Achieve Any Nash Equilibrium and Value

A Risk-Averse Equilibrium for Multi-Agent Systems

Finding Mixed Strategy Nash Equilibrium for Continuous Games through Deep Learning

Learning to Control Unknown Strongly Monotone Games

Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

Neural Auto-Curricula

Differentiable Bilevel Programming for Stackelberg Congestion Games

Towards convergence to Nash equilibria in two-team zero-sum games

Tractable Equilibrium Computation in Markov Games through Risk Aversion

Soft-Bellman Equilibrium in Affine Markov Games: Forward Solutions and Inverse Learning

A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

Adaptive Dynamic Programming for Solving Non-Zero-Sum Differential Games.

An efficient model‐free adaptive optimal control of continuous‐time nonlinear non‐zero‐sum games based on integral reinforcement learning with exploration

Distributed Nash equilibrium seeking strategies via bilateral bounded gradient approach

Learning generalized Nash equilibria in monotone games: A hybrid adaptive extremum seeking control approach

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

Inverse linear-quadratic nonzero-sum differential games