Learning to Play No-Press Diplomacy with Best Response Policy Iteration

Thomas Anthony,Tom Eccles,Andrea Tacchetti,János Kramár,Ian Gemp,Thomas C. Hudson,Nicolas Porcel,Marc Lanctot,Julien Pérolat,Richard Everett,Roman Werpachowski,Satinder Singh,Thore Graepel,Yoram Bachrach
DOI: https://doi.org/10.48550/arXiv.2006.04635
2022-01-04
Abstract:Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.
Machine Learning,Artificial Intelligence,Computer Science and Game Theory,Multiagent Systems
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: How to apply Reinforcement Learning (RL) algorithms to train agents in a multi - person non - zero - sum game environment, especially in the "No - Press Diplomacy" game, so that they can effectively handle complex multi - player interactions, simultaneous actions, and huge combinatorial action spaces. Specifically, the authors hope: 1. **Coping with complex interpersonal interactions**: The "Diplomacy" game emphasizes the tension between competition and cooperation, which makes it particularly suitable for studying learning problems in mixed - motive settings. Traditional RL methods are usually applicable to two - person zero - sum games, but in multi - person games, the relationships between players are more complex, including both cooperation and competition. 2. **Handling the challenges of simultaneous actions**: In the game, all players make decisions at the same time without knowing the choices of other players, which poses a high demand for predicting opponents' strategies. 3. **Overcoming the huge combinatorial action space**: "Diplomacy" has an extremely large combinatorial action space. The estimated size of the game tree is \(10^{900}\), and the number of legal joint actions per turn is between \(10^{21}\) and \(10^{64}\). An action space of this scale poses a significant challenge to existing RL algorithms. To address these problems, the authors propose a new Sampled Best Response (SBR) operator and introduce a series of policy - iteration - based methods to approximate iterative best responses and Fictitious Play. Through these improvements, they successfully apply RL to "No - Press Diplomacy" and show that their agents are significantly superior to the previous state - of - the - art. In addition, through game - theoretic equilibrium analysis, they prove the consistent improvement of the new method. In summary, the core objective of this paper is to explore and develop effective deep reinforcement learning methods in a complex, multi - person strategic game to better understand and simulate human behavior patterns when facing multiple stakeholders.