Abstract:We consider decentralized learning for zero-sum games, where players only see their payoff information and are agnostic to actions and payoffs of the opponent. Previous works demonstrated convergence to a Nash equilibrium in this setting using double time-scale algorithms under strong reachability assumptions. We address the open problem of achieving an approximate Nash equilibrium efficiently with an uncoupled and single time-scale algorithm under weaker conditions. Our contribution is a rational and convergent algorithm, utilizing Tsallis-entropy regularization in a value-iteration-based approach. The algorithm learns an approximate Nash equilibrium in polynomial time, requiring only the existence of a policy pair that induces an irreducible and aperiodic Markov chain, thus considerably weakening past assumptions. Our analysis leverages negative drift inequalities and introduces novel properties of Tsallis entropy that are of independent interest.
What problem does this paper attempt to address?
This paper is primarily dedicated to addressing the problem of learning Nash equilibrium in zero-sum Markov games and proposes a new algorithm to overcome the limitations of existing methods.
### Research Background and Objectives
- **Research Background**: Zero-sum Markov games are an important class of multi-agent decision problems where the interests of two players are directly opposed. Previous research has demonstrated that Nash equilibrium can be effectively computed when dynamics and rewards are known. However, in multi-agent reinforcement learning (MARL) scenarios, i.e., when dynamics or rewards are uncertain, learning Nash equilibrium faces numerous challenges.
- **Specific Objectives**: The goal of the paper is to design a decentralized, single-time-scale algorithm to efficiently learn approximate Nash equilibrium under weaker assumptions. Specifically, the algorithm needs to achieve approximate Nash equilibrium in polynomial time without requiring strict reachability and mixing time assumptions.
### Main Contributions
- **Algorithmic Contribution**: The paper proposes a new algorithm named Tsallis-Entropy Regularized Best-Response Dynamics with Value Iteration (TBRVI). This algorithm combines the principles of value iteration and best-response dynamics, and introduces Tsallis entropy regularization for policy updates, which helps ensure sufficient exploration and control mixing time.
- **Theoretical Contribution**: The paper proves that under weaker assumptions (only requiring the existence of a policy pair that induces an irreducible and aperiodic Markov chain), TBRVI can converge to an approximate Nash equilibrium in polynomial time. Additionally, by introducing Tsallis entropy, the paper develops a series of new theoretical properties that are crucial for proving convergence.
- **Technical Contribution**: By introducing Tsallis entropy regularization, the paper addresses the reachability and mixing time challenges present in previous algorithms. Specifically, Tsallis entropy helps ensure that policies have a certain level of exploration and can effectively control mixing time, allowing the algorithm to converge in polynomial time.
### Key Issues Addressed
- **Reachability Challenge**: Previous analyses relied on strong reachability assumptions, i.e., the existence of a positive integer \(L\) such that any state pair can reach each other in finite time under any policy pair. This assumption is overly strict in practical applications.
- **Mixing Time Challenge**: Most existing algorithms adopt a two-time-scale approach, meaning players need to implicitly coordinate to stop updating their policies for a period. This approach is not only difficult to implement but may also be unnecessary in some cases.
### Conclusion
In summary, by introducing Tsallis entropy regularization and developing new theoretical tools, this paper successfully addresses an open problem: how to efficiently learn approximate Nash equilibrium in zero-sum Markov games using decentralized and single-time-scale algorithms under weaker assumptions. This achievement significantly advances the theoretical and technical development of the multi-agent reinforcement learning field.