Responding to Promises: No-regret learning against followers with memory

Vijeth Hebbar,Cédric Langbort
2024-10-10
Abstract:We consider a repeated Stackelberg game setup where the leader faces a sequence of followers of unknown types and must learn what commitments to make. While previous works have considered followers that best respond to the commitment announced by the leader in every round, we relax this setup in two ways. Motivated by natural scenarios where the leader's reputation factors into how the followers choose their response, we consider followers with memory. Specifically, we model followers that base their response on not just the leader's current commitment but on an aggregate of their past commitments. In developing learning strategies that the leader can employ against such followers, we make the second relaxation and assume boundedly rational followers. In particular, we focus on followers employing quantal responses. Interestingly, we observe that the smoothness property offered by the quantal response (QR) model helps in addressing the challenge posed by learning against followers with memory. Utilizing techniques from online learning, we develop algorithms that guarantee $O(\sqrt{T})$ regret for quantal responding memory-less followers and $O(\sqrt{BT})$ regret for followers with bounded memory of length $B$ with both scaling polynomially in game parameters.
Computer Science and Game Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how the leader learns its optimal strategy in repeated Stackelberg games to deal with a series of followers of unknown types. Specifically, this paper mainly focuses on the following two aspects: 1. **Followers with memory**: Different from the assumption in previous studies that followers only make the best response according to the current commitment of the leader, this paper considers the memory effect of followers based on the past behavior of the leader. The response of followers depends not only on the current strategy of the leader but also on the past strategy of the leader. 2. **Bounded - rational followers**: The article further relaxes the assumption that followers always make the best response and considers bounded - rational followers using the quantal response (QR) model. This model allows followers to show a certain degree of irrationality or randomness in decision - making. ### Specific problem description In this context, the paper proposes two core problems: - **Problem 1**: When facing a series of memory - less and unknown - type followers, how does the leader learn its optimal strategy? - **Problem 2**: When facing followers with memory and unknown types, how does the leader learn its optimal strategy? ### Solutions To deal with these problems, the author has developed two algorithms: 1. **Algorithm for memory - less followers**: This algorithm ensures that when facing memory - less followers with quantal response, the leader's regret value is \(O(\sqrt{T})\), where \(T\) is the number of rounds of the game. 2. **Algorithm for followers with memory**: This algorithm ensures that when facing quantal - response followers with a finite memory length \(B\), the leader's regret value is \(O(\sqrt{BT})\). Through these algorithms, the author shows how to use online learning techniques to design strategies so that the leader can achieve near - optimal performance in long - term games, even when facing followers with memory and bounded rationality. ### Mathematical formulas The key formulas involved in the article include: - Definition of the leader's regret value: \[ \text{Regret}(H)=\max_{x\in\Delta_N}\left\langle Y(x)^T U^T x, G_H\right\rangle-\sum_{t = 1}^H\left\langle Y(x_t)^T U^T x_t, g_t\right\rangle \] where \(G_H=\sum_{t = 1}^H g_t\). - Definition of the regret value for followers with memory: \[ \text{Regret}_M(H)=\max_{x\in\Delta_N}\left\langle Y(x)^T U^T x, G_H\right\rangle-\sum_{t = 1}^H\left\langle Y(z_t)^T U^T x_t, g_t\right\rangle \] where \(z_t=\frac{1}{b_t}\sum_{\tau = 1}^t a_{t-\tau}x_\tau\) is the time - averaged leader strategy. These formulas are used to measure the learning performance of the leader when facing different types of followers and provide a theoretical basis for designing effective learning algorithms.