Abstract:Despite the significant potential for various applications, stochastic games with long-run average payoffs have received limited scholarly attention, particularly concerning the development of learning algorithms for them due to the challenges of mathematical analysis. In this paper, we study the stochastic games with long-run average payoffs and present an equivalent formulation for individual payoff gradients by defining advantage functions which will be proved to be bounded. This discovery allows us to demonstrate that the individual payoff gradient function is Lipschitz continuous with respect to the policy profile and that the value function of the games exhibits the gradient dominance property. Leveraging these insights, we devise a payoff-based gradient estimation approach and integrate it with the Regularized Robbins-Monro method from stochastic approximation theory to construct a bandit learning algorithm suited for stochastic games with long-run average payoffs. Additionally, we prove that if all players adopt our algorithm, the policy profile employed will asymptotically converge to a Nash equilibrium with probability one, provided that all Nash equilibria are globally neutrally stable and a globally variationally stable Nash equilibrium exists. This condition represents a wide class of games, including monotone games.

What problem does this paper attempt to address?

The main focus of this paper is how to design effective learning algorithms in Stochastic Games with long-term average rewards. The existing research in this field is relatively limited, especially due to the mathematical analysis challenges and the lack of attention towards developing learning algorithms. The authors propose a strategy gradient method based on Advantage Function, which is proven to be finite and well-defined, laying the foundation for subsequent analysis. They prove that the individual reward gradient is Lipschitz continuous in Stochastic Games with long-term average rewards, and the value function has gradient dominance property. Based on these findings, they design a reward-based gradient estimation method and integrate it with the Regularized Robbins-Monro method and mirror descent algorithm in stochastic approximation theory, constructing a distributed exploration algorithm applicable to Stochastic Games with long-term average rewards. The paper also proves that if all players adopt this algorithm, the policy configuration will converge to a Nash equilibrium, assuming all Nash equilibria are globally neutral stable and there exists a globally variationally stable Nash equilibrium. The main contributions of the paper include: 1. Expanding the concept of Advantage Function in reinforcement learning to Stochastic Games and proving its boundedness and well-definition. 2. Proving the Lipschitz continuity of individual reward gradient in Stochastic Games with long-term average rewards. 3. Designing a reward-based gradient estimation method and constructing a learning algorithm. 4. Proving the convergence of the algorithm to a Nash equilibrium under certain conditions. The paper is well-structured, starting with the introduction of the problem background and preliminary knowledge, analyzing the properties of the value function, proposing the gradient estimation method, discussing stability concepts and convergence of the algorithm, and finally providing a discussion.

A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence

Provable Policy Gradient Methods for Average-Reward Markov Potential Games

Gradient play in stochastic games: stationary points, convergence, and sample complexity

Stochastic Cubic-Regularized Policy Gradient Method

Convergence of Policy Gradient Methods for Nash Equilibria in General-sum Stochastic Games

Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

Decentralized Policy Gradient for Nash Equilibria Learning of General-sum Stochastic Games

Policy Iteration for Pareto-Optimal Policies in Stochastic Stackelberg Games

A unified stochastic approximation framework for learning in games

A Policy Iteration Algorithm for N-player General-Sum Linear Quadratic Dynamic Games

Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems

Convex-Concave Zero-sum Markov Stackelberg Games

Policy Gradient Adaptive Dynamic Programming for Nonlinear Discrete-Time Zero-Sum Games with Unknown Dynamics

Elementary Analysis of Policy Gradient Methods

A nearly Blackwell-optimal policy gradient method

Convergence and Price of Anarchy Guarantees of the Softmax Policy Gradient in Markov Potential Games

Global Convergence of Policy Gradient Methods in Reinforcement Learning, Games and Control

Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Stochastic first-order methods for average-reward Markov decision processes