Abstract:Self-play (SP) is a popular multi-agent reinforcement learning (MARL) framework for solving competitive games, where each agent optimizes policy by treating others as part of the environment. Despite the empirical successes, the theoretical properties of SP-based methods are limited to two-player zero-sum games. However, for mixed cooperative-competitive games where agents on the same team need to cooperate with each other, we can show a simple counter-example where SP-based methods cannot converge to a global Nash equilibrium (NE) with high probability. Alternatively, Policy-Space Response Oracles (PSRO) is an iterative framework for learning NE, where the best responses w.r.t. previous policies are learned in each iteration. PSRO can be directly extended to mixed cooperative-competitive settings by jointly learning team best responses with all convergence properties unchanged. However, PSRO requires repeatedly training joint policies from scratch till convergence, which makes it hard to scale to complex games. In this work, we develop a novel algorithm, Fictitious Cross-Play (FXP), which inherits the benefits from both frameworks. FXP simultaneously trains an SP-based main policy and a counter population of best response policies. The main policy is trained by fictitious self-play and cross-play against the counter population, while the counter policies are trained as the best responses to the main policy's past versions. We validate our method in matrix games and show that FXP converges to global NEs while SP methods fail. We also conduct experiments in a gridworld domain, where FXP achieves higher Elo ratings and lower exploitabilities than baselines, and a more challenging football game, where FXP defeats SOTA models with over 94% win rate.

Learn Adaptive Dynamic Policy under Mixed Multi-Agent Environment

Adaptive algorithm for multi-agent learning optimal cooperative pursuit strategy based on Markov game

A Dynamically Adaptive Approach to Reducing Strategic Interference for Multi-agent Systems

Twin Delayed Multi-Agent Deep Deterministic Policy Gradient

Dueling Network Architecture for Multi-Agent Deep Deterministic Policy Gradient

Multi-agent Hierarchical Policy Gradient for Air Combat Tactics Emergence Via Self-Play

Efficient Adaptation in Mixed-Motive Environments via Hierarchical Opponent Modeling and Planning

Decentralized Reinforcement Social Learning Based on Cooperative Policy Exploration in Multi-Agent Systems.

A Cooperative Multi-Agent Reinforcement Learning Algorithm Based on Dynamic Self-Selection Parameters Sharing

A Policy Gradient Algorithm to Alleviate the Multi-Agent Value Overestimation Problem in Complex Environments

Conservative Offline Policy Adaptation in Multi-Agent Games.

Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization

Multi-agent cooperation through learning-aware policy gradients

Efficient Competitive Self-Play Policy Optimization

Evolutionary Game Dynamics of Multi-Agent Cooperation Driven by Self-Learning

Fictitious Cross-Play: Learning Global Nash Equilibrium in Mixed Cooperative-Competitive Games

Special Agents Policy Gradient In Value Decomposition-based Approach

Efficient use of heuristics for accelerating XCS-based policy learning in Markov games

The Dynamics of Reinforcement Social Learning in Networked Cooperative Multiagent Systems

Policy Diversity for Cooperative Agents

MAPPG: Multi-agent Phasic Policy Gradient