Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles

Jiesong Lian
2024-06-21
Abstract:A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents' mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.
Computer Science and Game Theory,Artificial Intelligence,Machine Learning,Multiagent Systems
What problem does this paper attempt to address?
The paper attempts to address the issues of inefficiency and insufficient exploration when training strategies in zero-sum games using multi-agent reinforcement learning frameworks such as PSRO. Specifically: 1. **Inefficiency Issue**: Traditional PSRO methods require training new strategies from scratch in each iteration to approximate the best response (BR) to the opponent's mixed strategy, which is both time-consuming and costly. 2. **Insufficient Exploration Issue**: Some PSRO variants initialize new strategies by inheriting past BR strategies, but this approach limits the exploration of new strategies, especially when facing challenging opponents. To overcome these issues, the paper proposes the Fusion-PSRO framework, which initializes new strategies through policy fusion. The specific approach involves selecting high-quality base strategies from the meta-Nash equilibrium (meta-NE) and fusing these base strategies into a new strategy through model-weighted averaging. This approach better handles difficult opponents and only modifies the strategy initialization phase without adding extra training overhead. ### Main Contributions 1. **Policy Fusion Method**: A policy fusion method based on meta-Nash equilibrium is proposed, which generates new initialization strategies by weighted averaging multiple high-quality base strategies. 2. **Performance Improvement**: Experimental results show that Fusion-PSRO can significantly reduce the exploitability of various PSRO variants and improve their performance in complex games such as non-transitive matrix games, Leduc Poker, and Liars Dice. 3. **Theoretical Analysis**: The paper theoretically analyzes the improvements in utility and strategy exploration brought by Nash-weighted average initialization. ### Experimental Validation The paper validates the effectiveness of Fusion-PSRO through experiments on benchmark tests such as non-transitive mixed games, Leduc Poker, and Liars Dice. The experimental results demonstrate that Fusion-PSRO not only achieves lower exploitability but also more comprehensively explores the strategy space in non-transitive mixed games, further supporting the theoretical analysis conclusions.