Abstract:We consider the linear contextual multi-class multi-period packing problem (LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon $T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed by Agrawal & Devanur (2016) and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.

What problem does this paper attempt to address?

This paper aims to solve the decision - optimization problems in Multi - period Multi - class Packing Problems (LMMP), especially in the case of Bandit Feedback. Specifically, the goal of the paper is to maximize the total reward by choosing actions within a given time range while ensuring that resource consumption does not exceed the budget limit. The problem settings considered in the paper include: - **Multi - class items**: In each round, an item belonging to a certain class arrives. The decision - maker observes the context information of the item and selects an action based on this information. - **Linear context - dependence**: The reward and resource consumption vectors of each action are linear functions of the context, and this linear relationship depends on the class to which the item belongs. - **Bandit Feedback**: The decision - maker can only observe the results of the selected action and cannot directly obtain the information of the unselected actions. The main contributions of the paper lie in proposing a new estimator and an improved algorithm, which solve the problem that existing methods cannot provide effective solutions in the case of small budgets. Specifically: - **New estimator**: The paper proposes a new estimation strategy that utilizes the context information of all actions (including skipped rounds), thus achieving a faster convergence rate. - **Improved algorithm**: Based on the new estimator, the paper proposes an algorithm named "Allocate to the Maximum First" (AMF). This algorithm can effectively maximize the total reward under the premise of ensuring that resource consumption does not exceed the budget. - **Theoretical analysis**: The paper proves that under the non - degenerate context assumption, the regret bound of the AMF algorithm is $\tilde{O}\left(\frac{\text{OPT}}{B\sqrt{JdT}}\right)$, where $B$ is the budget, $J$ is the number of classes, $d$ is the context dimension, $T$ is the time range, and $\text{OPT}$ is the total reward of the optimal policy. Through these contributions, the paper not only solves an open problem proposed by Agrawal & Devanur (2016), but also extends the results to more general multi - class LMMP problems, which is of great significance in multiple practical application fields such as e - commerce, clinical trials, and dynamic pricing.

Improved Algorithms for Multi-period Multi-class Packing Problems with Bandit Feedback

An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives

Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression

High-dimensional Linear Bandits with Knapsacks

Multi-Objective Generalized Linear Bandits

Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks

Per-Round Knapsack-Constrained Linear Submodular Bandits

A Framework for Adapting Offline Algorithms to Solve Combinatorial Multi-Armed Bandit Problems with Bandit Feedback

Contextual Bandits with Arm Request Costs and Delays

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Learning with Guarantee via Constrained Multi-armed Bandit: Theory and Network Applications

Batched Nonparametric Contextual Bandits

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization

Learning Context-Aware Probabilistic Maximum Coverage Bandits: A Variance-Adaptive Approach

Combinatorial Logistic Bandits

Multi-Armed Bandit with Budget Constraint and Variable Costs.

A One-Size-Fits-All Solution to Conservative Bandit Problems

Federated Combinatorial Multi-Agent Multi-Armed Bandits

Combinatorial Multi-Armed Bandit: General Framework and Applications.