Conflux-PSRO: Effectively Leveraging Collective Advantages in Policy Space Response Oracles

Yucong Huang,Jiesong Lian,Mingzhi Wang,Chengdong Ma,Ying Wen
2024-10-30
Abstract:Policy Space Response Oracle (PSRO) with policy population construction has been demonstrated as an effective method for approximating Nash Equilibrium (NE) in zero-sum games. Existing studies have attempted to improve diversity in policy space, primarily by incorporating diversity regularization into the Best Response (BR). However, these methods cause the BR to deviate from maximizing rewards, easily resulting in a population that favors diversity over performance, even when diversity is not always necessary. Consequently, exploitability is difficult to reduce until policies are fully explored, especially in complex games. In this paper, we propose Conflux-PSRO, which fully exploits the diversity of the population by adaptively selecting and training policies at state-level. Specifically, Conflux-PSRO identifies useful policies from the existing population and employs a routing policy to select the most appropriate policies at each decision point, while simultaneously training them to enhance their effectiveness. Compared to the single-policy BR of traditional PSRO and its diversity-improved variants, the BR generated by Conflux-PSRO not only leverages the specialized expertise of diverse policies but also synergistically enhances overall performance. Our experiments on various environments demonstrate that Conflux-PSRO significantly improves the utility of BRs and reduces exploitability compared to existing methods.
Computer Science and Game Theory
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem in zero - sum games that the regularization terms introduced by the existing Policy Space Response Oracles (PSRO) methods in order to increase policy diversity cause the Best Response (BR) to deviate from the goal of maximizing rewards. Specifically, the existing methods tend to make BR favor diversity over performance, which is especially difficult to reduce exploitability in complex games until all policies are fully explored. To solve these problems, the authors propose a new method - Conflux - PSRO. This method makes full use of the diversity in the population by adaptively selecting and training policies at the state level, thereby effectively improving the quality of BR and reducing exploitability. The core idea of Conflux - PSRO is to select the most appropriate policies at each decision point and train these policies simultaneously to enhance their effectiveness. Compared with the traditional single - policy BR, Conflux - PSRO can not only utilize the expertise of diverse policies but also synergistically improve the overall performance. ### Main contributions of Conflux - PSRO 1. **Fusion of policy advantages**: Conflux - PSRO generates a stronger BR by identifying and selecting useful historical policies and combining them at the state level. 2. **Fine - grained policy selection**: The routing policy is used to select the most suitable sub - policy at each state, ensuring that the decision - making process makes full use of the advantages of the population. 3. **Improved exploration efficiency**: The fine - grained policy combination accelerates the exploration efficiency and generates more powerful policies. 4. **Reduced exploitability**: The experimental results show that Conflux - PSRO significantly improves the utility of BR and reduces exploitability, outperforming the existing advanced methods. ### Application scenarios The paper conducts experimental verification through multiple environments, including games such as Leduc Poker, Goofspiel, and Liar’s Dice. The experimental results show that Conflux - PSRO significantly reduces exploitability in these games and performs excellently in terms of BR performance. ### Conclusion Conflux - PSRO successfully resolves the contradiction between diversity and performance in the existing PSRO methods by adaptively selecting and training policies at the state level, providing a more effective method for approximating Nash Equilibrium.