Extensive Game Decision Based on the PPO-CFR Algorithm under Incomplete Information

Lei HUANG,Jin ZHU,Fuqing DUAN
DOI: https://doi.org/10.1360/ssi-2022-0216
2022-01-01
Scientia Sinica Informationis
Abstract:Human-computer gaming under incomplete information is usually described by a two-player zero-sum game model. Counterfactual regret minimization (CFR) is a popular algorithm for two-player zero-sum games with incomplete information. However, the existing CFR and its variant algorithms use fixed regret calculation and strategy update type in the iteration process, which have their advantages and disadvantages in the incomplete information extensive game, and their generalization performance is weak. To solve this problem, this paper combines the proximal policy optimization (PPO) algorithm in reinforcement learning with the CFR algorithm to train rational agents to adaptively select appropriate regret calculation and strategy update types in the CFR iteration process to improve the generalization performance of the current CFR algorithms and realize the policy optimization of the incomplete information extensive game. In this paper, general poker game experiments are used to verify the proposed algorithm, and a stepwise reward function is formulated to train the action policy of the agent. Experimental results show that compared with existing state-of-the-art methods, the PPO-CFR algorithm has better generalization performance and lower exploitability, and the iteration policy is closer to the Nash equilibrium policy.
What problem does this paper attempt to address?