Improved Algorithms for Multi-period Multi-class Packing Problems with Bandit Feedback

Wonyoung Kim,Garud Iyengar,Assaf Zeevi
DOI: https://doi.org/10.48550/arXiv.2301.13791
2023-06-01
Abstract:We consider the linear contextual multi-class multi-period packing problem (LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon $T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed by Agrawal & Devanur (2016) and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.
Machine Learning
What problem does this paper attempt to address?
This paper aims to solve the decision - optimization problems in Multi - period Multi - class Packing Problems (LMMP), especially in the case of Bandit Feedback. Specifically, the goal of the paper is to maximize the total reward by choosing actions within a given time range while ensuring that resource consumption does not exceed the budget limit. The problem settings considered in the paper include: - **Multi - class items**: In each round, an item belonging to a certain class arrives. The decision - maker observes the context information of the item and selects an action based on this information. - **Linear context - dependence**: The reward and resource consumption vectors of each action are linear functions of the context, and this linear relationship depends on the class to which the item belongs. - **Bandit Feedback**: The decision - maker can only observe the results of the selected action and cannot directly obtain the information of the unselected actions. The main contributions of the paper lie in proposing a new estimator and an improved algorithm, which solve the problem that existing methods cannot provide effective solutions in the case of small budgets. Specifically: - **New estimator**: The paper proposes a new estimation strategy that utilizes the context information of all actions (including skipped rounds), thus achieving a faster convergence rate. - **Improved algorithm**: Based on the new estimator, the paper proposes an algorithm named "Allocate to the Maximum First" (AMF). This algorithm can effectively maximize the total reward under the premise of ensuring that resource consumption does not exceed the budget. - **Theoretical analysis**: The paper proves that under the non - degenerate context assumption, the regret bound of the AMF algorithm is \(\tilde{O}\left(\frac{\text{OPT}}{B\sqrt{JdT}}\right)\), where \(B\) is the budget, \(J\) is the number of classes, \(d\) is the context dimension, \(T\) is the time range, and \(\text{OPT}\) is the total reward of the optimal policy. Through these contributions, the paper not only solves an open problem proposed by Agrawal & Devanur (2016), but also extends the results to more general multi - class LMMP problems, which is of great significance in multiple practical application fields such as e - commerce, clinical trials, and dynamic pricing.