Abstract:In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward models. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length T. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that for all light-tailed reward distributions, DSEE achieves the optimal logarithmic order of the regret, where regret is defined as the total expected reward loss against the ideal case with known reward models. For heavy-tailed reward distributions, DSEE achieves O(T^1/p) regret when the moments of the reward distributions exist up to the pth order for 1<p<=2 and O(T^1/(1+p/2)) for p>2. With the knowledge of an upperbound on a finite moment of the heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret order. The proposed DSEE approach complements existing work on MAB by providing corresponding results for general reward distributions. Furthermore, with a clearly defined tunable parameter-the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including MAB with various objectives, decentralized MAB with multiple players and incomplete reward observations under collisions, MAB with unknown Markov dynamics, and combinatorial MAB with dependent arms that often arise in network optimization problems such as the shortest path, the minimum spanning, and the dominating set problems under unknown random weights.

Empirical Gittins Index Strategies with Ε-Explorations for Multi-Armed Bandit Problems

Empirical Gittins index strategies with ?-explorations for multi-armed bandit problems

Computing the Performance of A New Adaptive Sampling Algorithm Based on The Gittins Index in Experiments with Exponential Rewards

A General Theory of MultiArmed Bandit Processes with Constrained Arm Switches

A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions

GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits

Tabular and Deep Reinforcement Learning for Gittins Index

Open Bandit Processes with Uncountable States and Time-Backward Effects

Reward Maximization for Pure Exploration: Minimax Optimal Good Arm Identification for Nonparametric Multi-Armed Bandits

Forced Exploration in Bandit Problems

Incentivized Exploration for Multi-Armed Bandits under Reward Drift.

Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards

Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems

Online Algorithms for the Multi-Armed Bandit Problem with Markovian Rewards

Adaptive Exploration in Stochastic Multi-armed Bandit Problem

Disentangling Exploration from Exploitation

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

Reinforcement Learning Augmented Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits.

Approximate information for efficient exploration-exploitation strategies

You Can Trade Your Experience in Distributed Multi-Agent Multi-Armed Bandits.

Two-Armed Restless Bandits with Imperfect Information: Stochastic Control and Indexability