Constrained Best Arm Identification in Grouped Bandits

Sahil Dharod,Malyala Preethi Sravani,Sakshi Heda,Sharayu Moharir
2024-12-11
Abstract:We study a grouped bandit setting where each arm comprises multiple independent sub-arms referred to as attributes. Each attribute of each arm has an independent stochastic reward. We impose the constraint that for an arm to be deemed feasible, the mean reward of all its attributes should exceed a specified threshold. The goal is to find the arm with the highest mean reward averaged across attributes among the set of feasible arms in the fixed confidence setting. We first characterize a fundamental limit on the performance of any policy. Following this, we propose a near-optimal confidence interval-based policy to solve this problem and provide analytical guarantees for the policy. We compare the performance of the proposed policy with that of two suitably modified versions of action elimination via simulations.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to find the feasible arms with the highest average attribute rewards in the grouped bandits setting. Specifically: 1. **Problem Background**: - Each arm consists of multiple independent sub - arms, which are called attributes. Each attribute has an independent random reward. - For an arm to be considered feasible, the average rewards of all its attributes must exceed a given threshold \(\mu_{TH}\). 2. **Objective**: - In the fixed - confidence setting, find the arm with the highest average attribute rewards among all feasible arms. - The fixed - confidence setting means that the algorithm needs to identify the optimal arm with at least a probability of \(1 - \delta\), while minimizing the number of samples. 3. **Main Challenges**: - How to efficiently identify the optimal arm while satisfying the feasibility constraint. - The algorithm needs to balance exploration and exploitation, that is, it needs to try different arms and attributes to obtain more information and make decisions based on the existing information. 4. **Research Contributions**: - The author first derives the fundamental lower bound of the performance of any online policy. - Proposes an approximately optimal policy based on confidence intervals and provides the theoretical performance guarantee of this policy. - Compares the performance of the proposed policy with two improved versions of the action - elimination algorithm through simulation, and the results show that the new algorithm is superior to other algorithms. 5. **Formula Representation**: - Feasibility Definition: For arm \(i\), if the average rewards of all its attributes are greater than or equal to the threshold \(\mu_{TH}\), then the arm is feasible: \[ F := \{ i \in [N] : \min_j \mu_{ij} \geq \mu_{TH} \} \] - Definition of the Optimal Feasible Arm: Among the set of feasible arms \(F\), the arm \(i^*\) with the highest average attribute rewards: \[ i^* := \arg\max_{i \in F} \mu_i, \quad \text{where} \quad \mu_i := \frac{1}{M} \sum_{j = 1}^M \mu_{ij} \] 6. **Conclusion**: - This research provides an effective method for solving the problem of identifying the best arm with constraints, and verifies its superiority through theoretical analysis and experimental verification. In summary, this paper aims to solve the problem of how to efficiently find the optimal arm that satisfies specific threshold constraints in the grouped multi - armed bandit setting, and proposes a new algorithm and its performance guarantee for this purpose.