ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret Learners

Xiangge Huang,Jingyuan Li,Jiaqing Xie
2024-08-26
Abstract:With the constraint of a no regret follower, will the players in a two-player Stackelberg game still reach Stackelberg equilibrium? We first show when the follower strategy is either reward-average or transform-reward-average, the two players can always get the Stackelberg Equilibrium. Then, we extend that the players can achieve the Stackelberg equilibrium in the two-player game under the no regret constraint. Also, we show a strict upper bound of the follower's utility difference between with and without no regret constraint. Moreover, in constant-sum two-player Stackelberg games with non-regret action sequences, we ensure the total optimal utility of the game remains also bounded.
Computer Science and Game Theory,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the case of a follower with "no regret" constraint, can the players in a two - person Stackelberg game still reach the Stackelberg equilibrium? Specifically, the paper explores the following issues: 1. **Existence of Stackelberg equilibrium**: When the follower's strategy is reward - average or transform - reward - average, can the two players always reach the Stackelberg equilibrium? 2. **Influence of no - regret constraint**: Under the no - regret constraint, can players always reach the Stackelberg equilibrium in a two - person game? In addition, the paper also establishes a strict upper bound on the utility difference of the follower with and without the no - regret constraint. 3. **Optimal utility in constant - sum games**: In a constant - sum two - person Stackelberg game, if the follower's action sequence is no - regret, is the total optimal utility of the game still bounded? ### Main contributions - **Theoretical proof**: The paper proves that under certain loose conditions, the two players can always reach the Stackelberg equilibrium (see Theorem 9). Further, it shows that under the no - regret constraint, players can consistently achieve the Stackelberg equilibrium. - **Upper bound of utility difference**: The paper establishes a strict upper bound that describes the utility difference of the follower with and without the no - regret constraint. - **Utility guarantee in constant - sum games**: In a constant - sum two - person Stackelberg game, if the follower's action sequence is no - regret, then the total optimal utility of the game remains bounded. ### Experimental verification Through theoretical analysis and experimental verification, the paper shows that in the multi - agent reinforcement learning framework, the leader can use the reinforcement learning algorithm, and the follower can use the no - regret algorithm, so that the entire system reaches the Stackelberg equilibrium. The experimental results show that in various matrix game environments, the strategies under the no - regret constraint can be close to or even reach the Stackelberg equilibrium. ### Formula representation To ensure the correctness and readability of the formulas, the following are the key formulas involved in the paper: - **Definition of regret value**: \[ \text{Reg}_T(\vec{a}_F)=\max_{\vec{a}_F}\mathbb{E}_{d_F(s_0)}\left[\sum_{t = 0}^T R_t^F(s_t^F,a_t^F,\bar{s}_t^L,\bar{a}_t^L)\mid a_t^F\sim\pi_F(a\mid s_t^F,a_t^L),s_0^F\sim d_F(s_0),\bar{s}_t^L,\bar{a}_t^L\right]-\sum_{t = 0}^T\bar{R}_t^F \] - **Best reward operator**: \[ \mu^*_{\vec{a}_F}R_T^F=\max_{\vec{a}_F}\mathbb{E}_{d_F(s_0)}\left[\sum_{t = 0}^T R_t^F(s_t^F,a_t^F,\bar{s}_t^L,\bar{a}_t^L)\mid a_t^F\sim\pi_F(a\mid s_t^F,a_t^L),s_0^F\sim d_F(s_0),\bar{s}_t^L,\bar{a}_t^L\right] \] - **Definition of no - regret property**: \[ \mathbb{E}\left[\mu^*_{\vec{a}_F}R_T^F-\sum_{t = 0}^T\bar{R}_t^F\right]=o(T) \] These formulas help to understand the no - regret property of the follower and its influence on the Stackelberg equilibrium.