Player-optimal Stable Regret for Bandit Learning in Matching Markets

Fang Kong,Shuai Li
2023-07-20
Abstract:The problem of matching markets has been studied for a long time in the literature due to its wide range of applications. Finding a stable matching is a common equilibrium objective in this problem. Since market participants are usually uncertain of their preferences, a rich line of recent works study the online setting where one-side participants (players) learn their unknown preferences from iterative interactions with the other side (arms). Most previous works in this line are only able to derive theoretical guarantees for player-pessimal stable regret, which is defined compared with the players' least-preferred stable matching. However, under the pessimal stable matching, players only obtain the least reward among all stable matchings. To maximize players' profits, player-optimal stable matching would be the most desirable. Though \citet{basu21beyond} successfully bring an upper bound for player-optimal stable regret, their result can be exponentially large if players' preference gap is small. Whether a polynomial guarantee for this regret exists is a significant but still open problem. In this work, we provide a new algorithm named explore-then-Gale-Shapley (ETGS) and show that the optimal stable regret of each player can be upper bounded by $O(K\log T/\Delta^2)$ where $K$ is the number of arms, $T$ is the horizon and $\Delta$ is the players' minimum preference gap among the first $N+1$-ranked arms. This result significantly improves previous works which either have a weaker player-pessimal stable matching objective or apply only to markets with special assumptions. When the preferences of participants satisfy some special conditions, our regret upper bound also matches the previously derived lower bound.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to find the player - optimal stable matching through bandit learning in the bilateral matching market under uncertain preferences, and minimize the player - optimal stable regret. Specifically: 1. **Background and Motivation**: - In the bilateral matching market, finding a stable matching is a common equilibrium goal. - Since market participants are usually uncertain about their preferences, recent research has focused on online settings, where one - side participants (players) learn their unknown preferences through iterative interactions with the other side (arms). - Most previous work can only provide theoretical guarantees for the player - pessimal stable regret, which is defined relative to the player's least - popular stable matching. However, under the worst - stable matching, the player can only obtain the lowest reward among all stable matchings. 2. **Research Questions**: - To maximize the player's profit, the player - optimal stable matching is the most desirable. - Although Basu et al. [2021] successfully provided an upper bound for the player - optimal stable regret, this upper bound may grow exponentially when the player's preference gap is small. - Whether there exists a polynomial - level upper bound for the player - optimal stable regret remains an important open question. 3. **Solutions**: - The paper proposes a new algorithm, called Explore - then - Gale - Shapley (ETGS). - This algorithm can limit the upper bound of each player's optimal stable regret to \( O\left(\frac{K \log T}{\Delta^2}\right) \), where \( K \) is the number of arms, \( T \) is the time horizon, and \( \Delta \) is the minimum preference gap of the player for the top \( N + 1 \) arms. 4. **Contributions**: - This is the first time to provide a polynomial - level upper bound for the player - optimal stable regret in the general decentralized bilateral matching market. - Compared with previous algorithms, the ETGS algorithm is not only applicable to more general decentralized settings, but also does not need to know the values of \( T \) and \( \Delta \) in advance. - When the participants' preferences meet certain special conditions, the regret upper bound of this algorithm also matches the previously derived lower bound. ### Formula Summary - **Upper Bound of Player - Optimal Stable Regret**: \[ \text{Reg}_i(T)=O\left(\frac{K \log T}{\Delta^2}\right) \] where \( K \) is the number of arms, \( T \) is the time horizon, and \( \Delta \) is the minimum preference gap of the player for the top \( N + 1 \) arms. - **Complete Formula**: \[ \text{Reg}_i(T)\leq\left( N+\frac{192K \log T}{\Delta^2}+\log\left(\frac{192K \log T}{\Delta^2}\right)+N^2 + 2NK\right)\cdot\Delta_{i,\max} \] Through these improvements, this paper solves the problem of how to efficiently find the player - optimal stable matching under uncertain preferences and provides more stringent theoretical guarantees.