Abstract:We study the problem of learning a Nash equilibrium (NE) in Markov games which is a cornerstone in multi-agent reinforcement learning (MARL). In particular, we focus on infinite-horizon adversarial team Markov games (ATMGs) in which agents that share a common reward function compete against a single opponent, the adversary. These games unify two-player zero-sum Markov games and Markov potential games, resulting in a setting that encompasses both collaboration and competition. Kalogiannis et al. (2023a) provided an efficient equilibrium computation algorithm for ATMGs which presumes knowledge of the reward and transition functions and has no sample complexity guarantees. We contribute a learning algorithm that utilizes MARL policy gradient methods with iteration and sample complexity that is polynomial in the approximation error $\epsilon$ and the natural parameters of the ATMG, resolving the main caveats of the solution by (Kalogiannis et al., 2023a). It is worth noting that previously, the existence of learning algorithms for NE was known for Markov two-player zero-sum and potential games but not for ATMGs. Seen through the lens of min-max optimization, computing a NE in these games consists a nonconvex-nonconcave saddle-point problem. Min-max optimization has received extensive study. Nevertheless, the case of nonconvex-nonconcave landscapes remains elusive: in full generality, finding saddle-points is computationally intractable (Daskalakis et al., 2021). We circumvent the aforementioned intractability by developing techniques that exploit the hidden structure of the objective function via a nonconvex-concave reformulation. However, this introduces the challenge of a feasibility set with coupled constraints. We tackle these challenges by establishing novel techniques for optimizing weakly-smooth nonconvex functions, extending the framework of (Devolder et al., 2014).

Bandit Learning in Convex Non-Strictly Monotone Games

Bandit learning in concave $N$-person games

Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback

No-regret Learning for Repeated Non-Cooperative Games with Lossy Bandits

Uncoupled and Convergent Learning in Monotone Games under Bandit Feedback

Online Bandit Learning for a Special Class of Non-Convex Losses

Bandit learning with regularized second-order mirror descent

Decentralized Nash Equilibria Learning for Online Game with Bandit Feedback

Near-Optimal Learning of Extensive-Form Games with Imperfect Information

Convergence Rate of Learning a Strongly Variationally Stable Equilibrium

Doubly Optimal No-Regret Learning in Monotone Games

Learning Nash Equilibria in Monotone Games

Risk-Averse No-Regret Learning in Online Convex Games

Asymmetric Feedback Learning in Online Convex Games

Convergence Rate of Payoff-based Generalized Nash Equilibrium Learning

Learning of Nash Equilibria in Risk-Averse Games

Learning Equilibria in Adversarial Team Markov Games: A Nonconvex-Hidden-Concave Min-Max Optimization Problem

Identify the Nash Equilibrium in Static Games with Random Payoffs.

Online Monotone Games

Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback

No-Regret Learning in Time-Varying Zero-Sum Games