Abstract:Sequential learning in a multi-agent resource constrained matching market has received significant interest in the past few years. We study decentralized learning in two-sided matching markets where the demand side (aka players or agents) competes for a `large' supply side (aka arms) with potentially time-varying preferences, to obtain a stable match. Despite a long line of work in the recent past, existing learning algorithms such as Explore-Then-Commit or Upper-Confidence-Bound remain inefficient for this problem. In particular, the per-agent regret achieved by these algorithms scales linearly with the number of arms, $K$. Motivated by the linear contextual bandit framework, we assume that for each agent an arm-mean can be represented by a linear function of a known feature vector and an unknown (agent-specific) parameter. Moreover, our setup captures the essence of a dynamic (non-stationary) matching market where the preferences over arms change over time. Our proposed algorithms achieve instance-dependent logarithmic regret, scaling independently of the number of arms, $K$.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper studies the problem of how multi - agents perform decentralized learning to achieve stable matching in **large - scale dynamic matching markets**. Specifically: 1. **Large - scale resource - constrained matching markets**: - In this market, the demand side (i.e., players or agents) is matched with a large number of supply sides (i.e., items or tasks). - The number of supply sides $ K $ is much larger than the number of demand sides $ N $, that is, $ K \gg N $. - The preferences of the demand side for the supply side are not fixed but change over time. 2. **Deficiencies of existing algorithms**: - Existing learning algorithms such as "Explore - Then - Commit" and "Upper - Confidence - Bound (UCB)" perform poorly in this scenario. - The regret value of each agent in these algorithms grows linearly with the number of supply sides $ K $, which makes them inefficient in large - scale markets. 3. **Introduction of the linear contextual multi - armed bandit model**: - The paper assumes that the mean reward of each agent can be represented by a linear combination of known feature vectors and unknown parameters, that is, $ \langle x_{ij}(t), \theta_i \rangle $. - This assumption allows the model to capture the essence of dynamic (non - stationary) matching markets, where preferences change over time. 4. **Objectives**: - Propose new algorithms so that the regret value of each agent reaches an instance - dependent logarithmic level and does not depend on the number of supply sides $ K $. - Reduce the challenges brought by a large number of supply sides through a structured exploration strategy. 5. **Contributions**: - Proposed solutions to two main problems: - **Context - matching markets in a fixed environment**: Assuming that the preference ranking remains unchanged throughout the learning process, agents gradually learn preferences through polling exploration. - **Context - matching markets in multiple environments**: Consider multiple different environments, each with a different preference ranking, and agents need to identify the current environment and find a stable match. 6. **Results**: - The proposed algorithms achieve a logarithmic - level regret value for each agent, and the regret value is related to the dimension $ d $ of the feature vector, rather than the number of supply sides $ K $. - This means that in practical applications, when $ K $ is large, resources can be significantly saved by carefully designing the feature vector. ### Summary This paper aims to solve the problem of how multi - agents achieve stable matching through decentralized learning in large - scale, dynamic, and resource - constrained matching markets. By introducing the linear contextual multi - armed bandit model, the proposed new algorithms can achieve a logarithmic - level regret value without depending on the number of supply sides, thereby improving the learning efficiency in large - scale markets.

Competing Bandits in Decentralized Large Contextual Matching Markets

Competing Bandits in Non-Stationary Matching Markets

Bandit Learning in Decentralized Matching Markets

Decentralized, Communication- and Coordination-free Learning in Structured Matching Markets

Explore-then-Commit Algorithms for Decentralized Two-Sided Matching Markets

Decentralized Competing Bandits in Many-to-One Matching Markets.

Dynamic Matching Bandit For Two-Sided Online Markets

Bandit Learning in Many-to-One Matching Markets

Bandit based centralized matching in two-sided markets for peer to peer lending

Learning Optimal Stable Matches in Decentralized Markets with Unknown Preferences

Bandit Learning in Matching Markets: Utilitarian and Rawlsian Perspectives

Contextual Bandits with Arm Request Costs and Delays

Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective

Learning in Multi-Stage Decentralized Matching Markets

Decentralized and Uncoordinated Learning of Stable Matchings: A Game-Theoretic Approach

Distributed Bandits with Heterogeneous Agents

Player-optimal Stable Regret for Bandit Learning in Matching Markets

Federated Combinatorial Multi-Agent Multi-Armed Bandits

Learning Contextual Bandits in a Non-stationary Environment

Improved Bandits in Many-to-one Matching Markets with Incentive Compatibility

OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits