A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

Junyue Zhang,Yifen Mu
2024-05-16
Abstract:Despite the significant potential for various applications, stochastic games with long-run average payoffs have received limited scholarly attention, particularly concerning the development of learning algorithms for them due to the challenges of mathematical analysis. In this paper, we study the stochastic games with long-run average payoffs and present an equivalent formulation for individual payoff gradients by defining advantage functions which will be proved to be bounded. This discovery allows us to demonstrate that the individual payoff gradient function is Lipschitz continuous with respect to the policy profile and that the value function of the games exhibits the gradient dominance property. Leveraging these insights, we devise a payoff-based gradient estimation approach and integrate it with the Regularized Robbins-Monro method from stochastic approximation theory to construct a bandit learning algorithm suited for stochastic games with long-run average payoffs. Additionally, we prove that if all players adopt our algorithm, the policy profile employed will asymptotically converge to a Nash equilibrium with probability one, provided that all Nash equilibria are globally neutrally stable and a globally variationally stable Nash equilibrium exists. This condition represents a wide class of games, including monotone games.
Computer Science and Game Theory
What problem does this paper attempt to address?
The main focus of this paper is how to design effective learning algorithms in Stochastic Games with long-term average rewards. The existing research in this field is relatively limited, especially due to the mathematical analysis challenges and the lack of attention towards developing learning algorithms. The authors propose a strategy gradient method based on Advantage Function, which is proven to be finite and well-defined, laying the foundation for subsequent analysis. They prove that the individual reward gradient is Lipschitz continuous in Stochastic Games with long-term average rewards, and the value function has gradient dominance property. Based on these findings, they design a reward-based gradient estimation method and integrate it with the Regularized Robbins-Monro method and mirror descent algorithm in stochastic approximation theory, constructing a distributed exploration algorithm applicable to Stochastic Games with long-term average rewards. The paper also proves that if all players adopt this algorithm, the policy configuration will converge to a Nash equilibrium, assuming all Nash equilibria are globally neutral stable and there exists a globally variationally stable Nash equilibrium. The main contributions of the paper include: 1. Expanding the concept of Advantage Function in reinforcement learning to Stochastic Games and proving its boundedness and well-definition. 2. Proving the Lipschitz continuity of individual reward gradient in Stochastic Games with long-term average rewards. 3. Designing a reward-based gradient estimation method and constructing a learning algorithm. 4. Proving the convergence of the algorithm to a Nash equilibrium under certain conditions. The paper is well-structured, starting with the introduction of the problem background and preliminary knowledge, analyzing the properties of the value function, proposing the gradient estimation method, discussing stability concepts and convergence of the algorithm, and finally providing a discussion.