Competing Bandits in Decentralized Large Contextual Matching Markets

Satush Parikh,Soumya Basu,Avishek Ghosh,Abishek Sankararaman
2024-11-19
Abstract:Sequential learning in a multi-agent resource constrained matching market has received significant interest in the past few years. We study decentralized learning in two-sided matching markets where the demand side (aka players or agents) competes for a `large' supply side (aka arms) with potentially time-varying preferences, to obtain a stable match. Despite a long line of work in the recent past, existing learning algorithms such as Explore-Then-Commit or Upper-Confidence-Bound remain inefficient for this problem. In particular, the per-agent regret achieved by these algorithms scales linearly with the number of arms, $K$. Motivated by the linear contextual bandit framework, we assume that for each agent an arm-mean can be represented by a linear function of a known feature vector and an unknown (agent-specific) parameter. Moreover, our setup captures the essence of a dynamic (non-stationary) matching market where the preferences over arms change over time. Our proposed algorithms achieve instance-dependent logarithmic regret, scaling independently of the number of arms, $K$.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper studies the problem of how multi - agents perform decentralized learning to achieve stable matching in **large - scale dynamic matching markets**. Specifically: 1. **Large - scale resource - constrained matching markets**: - In this market, the demand side (i.e., players or agents) is matched with a large number of supply sides (i.e., items or tasks). - The number of supply sides \( K \) is much larger than the number of demand sides \( N \), that is, \( K \gg N \). - The preferences of the demand side for the supply side are not fixed but change over time. 2. **Deficiencies of existing algorithms**: - Existing learning algorithms such as "Explore - Then - Commit" and "Upper - Confidence - Bound (UCB)" perform poorly in this scenario. - The regret value of each agent in these algorithms grows linearly with the number of supply sides \( K \), which makes them inefficient in large - scale markets. 3. **Introduction of the linear contextual multi - armed bandit model**: - The paper assumes that the mean reward of each agent can be represented by a linear combination of known feature vectors and unknown parameters, that is, \( \langle x_{ij}(t), \theta_i \rangle \). - This assumption allows the model to capture the essence of dynamic (non - stationary) matching markets, where preferences change over time. 4. **Objectives**: - Propose new algorithms so that the regret value of each agent reaches an instance - dependent logarithmic level and does not depend on the number of supply sides \( K \). - Reduce the challenges brought by a large number of supply sides through a structured exploration strategy. 5. **Contributions**: - Proposed solutions to two main problems: - **Context - matching markets in a fixed environment**: Assuming that the preference ranking remains unchanged throughout the learning process, agents gradually learn preferences through polling exploration. - **Context - matching markets in multiple environments**: Consider multiple different environments, each with a different preference ranking, and agents need to identify the current environment and find a stable match. 6. **Results**: - The proposed algorithms achieve a logarithmic - level regret value for each agent, and the regret value is related to the dimension \( d \) of the feature vector, rather than the number of supply sides \( K \). - This means that in practical applications, when \( K \) is large, resources can be significantly saved by carefully designing the feature vector. ### Summary This paper aims to solve the problem of how multi - agents achieve stable matching through decentralized learning in large - scale, dynamic, and resource - constrained matching markets. By introducing the linear contextual multi - armed bandit model, the proposed new algorithms can achieve a logarithmic - level regret value without depending on the number of supply sides, thereby improving the learning efficiency in large - scale markets.