Abstract:In this paper we study a generalized version of classical multi-armed bandits (MABs) problem by allowing for arbitrary constraints on constituent bandits at each decision point. The motivation of this study comes from many situations that involve repeatedly making choices subject to arbitrary constraints in an uncertain environment: for instance, regularly deciding which advertisements to display online in order to gain high click-through-rate without knowing user preferences, or what route to drive home each day under uncertain weather and traffic conditions. Assume that there are K unknown random variables (RVs), i.e., arms, each evolving as an i.i.d stochastic process over time. At each decision epoch, we select a strategy, i.e., a subset of RVs, subject to arbitrary constraints on constituent RVs. We then gain a reward that is a linear combination of observations on selected RVs. The performance of prior results for this problem heavily depends on the distribution of strategies generated by corresponding learning policy. For example, if the reward-difference between the best and second best strategy approaches zero, prior result may lead to arbitrarily large regret. Meanwhile, when there are exponential number of possible strategies at each decision point, naive extension of a prior distribution-free policy would cause poor performance in terms of regret, computation and space complexity. To this end, we propose an efficient Distribution-Free Learning (DFL) policy that achieves zero regret, regardless of the probability distribution of the resultant strategies. Our learning policy has both O(K) time complexity and O(K) space complexity. In successive generations, we show that even if finding the optimal strategy at each decision point is NP-hard, our policy still allows for approximated solutions while retaining near zero-regret.

Distributed Multi-Armed Bandits: Regret Vs. Communication.

Regret Vs. Communication: Distributed Stochastic Multi-Armed Bandits and Beyond

Distributed Bandit Learning: Near-Optimal Regret with Efficient Communication.

Communication-Efficient Collaborative Regret Minimization in Multi-Armed Bandits

A Decentralized Policy with Logarithmic Regret for a Class of Multi-Agent Multi-Armed Bandit Problems with Option Unavailability Constraints and Stochastic Communication Protocols

Cooperative Multi-agent Bandits: Distributed Algorithms with Optimal Individual Regret and Constant Communication Costs

Achieve Near-Optimal Individual Regret & Low Communications in Multi-Agent Bandits

Constant or logarithmic regret in asynchronous multiplayer bandits

Byzantine-Resilient Decentralized Multi-Armed Bandits

Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Distributed Bandits with Heterogeneous Agents

Towards Distribution-Free Multi-Armed Bandits with Combinatorial Strategies

Individual Regret in Cooperative Stochastic Multi-Armed Bandits

On Regret-optimal Cooperative Nonstochastic Multi-armed Bandits

Combinatorial Multi-Armed Bandit: General Framework and Applications.

Distributed Stochastic Bandit Learning with Delayed Context Observation

Settling the Communication Complexity for Distributed Offline Reinforcement Learning

Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits

Distributed No-Regret Learning for Multi-Stage Systems with End-to-End Bandit Feedback

Networked Bandits With Disjoint Linear Payoffs

Distributed Differential Privacy in Multi-Armed Bandits