Abstract:This brief studies a variation of the stochastic multiarmed bandit (MAB) problems, where the agent knows the a priori knowledge named the near-optimal mean reward (NoMR). In common MAB problems, an agent tries to find the optimal arm without knowing the optimal mean reward. However, in more practical applications, the agent can usually get an estimation of the optimal mean reward defined as NoMR. For instance, in an online Web advertising system based on MAB methods, a user's near-optimal average click rate (NoMR) can be roughly estimated from his/her demographic characteristics. As a result, application of the NoMR is efficient at improving the algorithm's performance. First, we formalize the stochastic MAB problem by knowing the NoMR that is in between the suboptimal mean reward and the optimal mean reward. Second, we use the cumulative regret as the performance metric for our problem, and we get that this problem's lower bound of the cumulative regret is Omega(1/Delta), where Delta is the difference between the suboptimal mean reward and the optimal mean reward. Compared with the conventional MAB problem with the increasing logarithmic lower bound of the regret, our regret lower bound is uniform with the learning step. Third, a novel algorithm, NOMR-BANDIT, is set forth to solve this problem. In NOMR-BANDIT, the NoMR is used to design an efficient exploration strategy. In addition, we analyzed the regret's upper bound in NOMR-BANDIT and concluded that it also has a uniform upper bound of O(1/Delta), which is in the same order as the lower bound. Consequently, NOMR-BANDIT is an optimal algorithm of this problem. To enhance our method's generalization, CASCADE-BANDIT based on NOMR-BANDIT is proposed to solve the problem, where NoMR is less than the suboptimal mean reward. CASCADE-BANDIT has an upper bound of O(Delta log n), where n represents the learning step, and the order of O(Delta log n) is the same with that of the conventional MAB methods. Finally, extensive experimental results demonstrated that the established NOMR-BANDIT is more efficient than the compared bandit solutions. After sufficient iterations, NOMR-BANDIT saved 10%-80% more cumulative regret than the state of the art.

DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

Beyond Reward: Offline Preference-guided Policy Optimization

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints

Online Learning with Diverse User Preferences

Online Bandit Learning with Offline Preference Data

A Federated Online Restless Bandit Framework for Cooperative Resource Allocation

Dueling Posterior Sampling for Preference-Based Reinforcement Learning

Adversarial Bandits with Multi-User Delayed Feedback: Theory and Application

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits.

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

The Bandit Whisperer: Communication Learning for Restless Bandits

An Optimal Algorithm for the Stochastic Bandits While Knowing the Near-Optimal Mean Reward

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret

The Non-Bayesian Restless Multi-Armed Bandit: a Case of Near-Logarithmic Regret

Efficient Resource Allocation with Fairness Constraints in Restless Multi-Armed Bandits

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback