Abstract:We study a grouped bandit setting where each arm comprises multiple independent sub-arms referred to as attributes. Each attribute of each arm has an independent stochastic reward. We impose the constraint that for an arm to be deemed feasible, the mean reward of all its attributes should exceed a specified threshold. The goal is to find the arm with the highest mean reward averaged across attributes among the set of feasible arms in the fixed confidence setting. We first characterize a fundamental limit on the performance of any policy. Following this, we propose a near-optimal confidence interval-based policy to solve this problem and provide analytical guarantees for the policy. We compare the performance of the proposed policy with that of two suitably modified versions of action elimination via simulations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to find the feasible arms with the highest average attribute rewards in the grouped bandits setting. Specifically: 1. **Problem Background**: - Each arm consists of multiple independent sub - arms, which are called attributes. Each attribute has an independent random reward. - For an arm to be considered feasible, the average rewards of all its attributes must exceed a given threshold \(\mu_{TH}\). 2. **Objective**: - In the fixed - confidence setting, find the arm with the highest average attribute rewards among all feasible arms. - The fixed - confidence setting means that the algorithm needs to identify the optimal arm with at least a probability of \(1 - \delta\), while minimizing the number of samples. 3. **Main Challenges**: - How to efficiently identify the optimal arm while satisfying the feasibility constraint. - The algorithm needs to balance exploration and exploitation, that is, it needs to try different arms and attributes to obtain more information and make decisions based on the existing information. 4. **Research Contributions**: - The author first derives the fundamental lower bound of the performance of any online policy. - Proposes an approximately optimal policy based on confidence intervals and provides the theoretical performance guarantee of this policy. - Compares the performance of the proposed policy with two improved versions of the action - elimination algorithm through simulation, and the results show that the new algorithm is superior to other algorithms. 5. **Formula Representation**: - Feasibility Definition: For arm \(i\), if the average rewards of all its attributes are greater than or equal to the threshold \(\mu_{TH}\), then the arm is feasible: \[ F := \{ i \in [N] : \min_j \mu_{ij} \geq \mu_{TH} \} \] - Definition of the Optimal Feasible Arm: Among the set of feasible arms \(F\), the arm \(i^*\) with the highest average attribute rewards: \[ i^* := \arg\max_{i \in F} \mu_i, \quad \text{where} \quad \mu_i := \frac{1}{M} \sum_{j = 1}^M \mu_{ij} \] 6. **Conclusion**: - This research provides an effective method for solving the problem of identifying the best arm with constraints, and verifies its superiority through theoretical analysis and experimental verification. In summary, this paper aims to solve the problem of how to efficiently find the optimal arm that satisfies specific threshold constraints in the grouped multi - armed bandit setting, and proposes a new algorithm and its performance guarantee for this purpose.

Constrained Best Arm Identification in Grouped Bandits

Best Arm Identification in Bandits with Limited Precision Sampling

Pure Exploration in Bandits with Linear Constraints

Optimal Best Arm Identification with Fixed Confidence in Restless Bandits

Top Feasible-Arm Selections in Constrained Multi-Armed Bandit

Best Arm Identification with Minimal Regret

Best Arm Identification in Linear Bandits with Linear Dimension Dependency.

Best Arm Identification in Stochastic Bandits: Beyond $β-$optimality

Best Arm Identification in Batched Multi-armed Bandit Problems

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Best Arm Identification in Spectral Bandits

Robust Best-arm Identification in Linear Bandits

Exploring Best Arm With Top Reward-Cost Ratio In Stochastic Bandits

Combinatorial Multi-armed Bandits: Arm Selection via Group Testing

Max-Quantile Grouped Infinite-Arm Bandits

Adaptive Multiple-Arm Identification

Functional Bandits

Best arm identification in multi-armed bandits with delayed feedback

The Role of Contextual Information in Best Arm Identification

Optimal Best-Arm Identification in Bandits with Access to Offline Data

Multi-Agent Best Arm Identification in Stochastic Linear Bandits