Abstract:Cooperative multi-agent multi-armed bandits (\CMAB) study how distributed agents cooperatively play the same multi-armed bandit game. Most existing \CMAB works focused on maximizing the group performance of all agents---the accumulation of all agents' individual performance (i.e., individual reward). However, in many applications, the performance of the system is more sensitive to the ``bad'' agent---the agent with the worst individual performance. For example, in a drone swarm, a ``bad'' agent may crash into other drones and severely degrade the system performance. In that case, the key of the learning algorithm design is to coordinate computational and communicational resources among agents so to optimize the individual learning performance of the ``bad'' agent. In \CMAB, maximizing the group performance is equivalent to minimizing the group regret of all agents, and minimizing the individual performance can be measured by minimizing the maximum (worst) individual regret among agents. Minimizing the maximum individual regret was largely ignored in prior literature, and currently, there is little work on how to minimize this objective with a low communication overhead. In this paper, we propose a near-optimal algorithm on both individual and group regrets, in addition, we also propose a novel communication module in the algorithm, which only needs \(O(\log (\log T))\) communication times where \(T\) is the number of decision rounds. We also conduct simulations to illustrate the advantage of our algorithm by comparing it to other known baselines.

Exploration for Free: How Does Reward Heterogeneity Improve Regret in Cooperative Multi-agent Bandits?

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

Bayesian Incentive-Compatible Bandit Exploration

Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Distributed Bandits with Heterogeneous Agents

Fair Exploration via Axiomatic Bargaining

Cooperative Multi-agent Bandits: Distributed Algorithms with Optimal Individual Regret and Constant Communication Costs

Achieve Near-Optimal Individual Regret & Low Communications in Multi-Agent Bandits

Satisficing Exploration in Bandit Optimization

Incentivized Exploration of Non-Stationary Stochastic Bandits

Diminishing Exploration: A Minimalist Approach to Piecewise Stationary Multi-Armed Bandits

Multi-Armed Bandits with Abstention

Bandits with concave rewards and convex knapsacks

Optimal Regret Bounds for Collaborative Learning in Bandits

Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption

Forced Exploration in Bandit Problems

Competing for Shareable Arms in Multi-Player Multi-Armed Bandits

Regret Vs. Communication: Distributed Stochastic Multi-Armed Bandits and Beyond

Heterogeneous Multi-player Multi-armed Bandits: Closing the Gap and Generalization

On Regret-optimal Cooperative Nonstochastic Multi-armed Bandits