A Distributed Algorithm for Sequential Decision Making in Multi-Armed Bandit with Homogeneous Rewards

Jingxuan Zhu,Romeil Sandhu,Ji Liu
DOI: https://doi.org/10.1109/CDC42340.2020.9303836
2020-01-01
Abstract:This paper studies a distributed multi-armed bandit problem over a network of N agents, each of which can communicate only with its neighbors, where neighbor relationships are described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet it only has access to local samples of the reward for each action, which is a random variable. A distributed upper confidence bound (UCB) algorithm is proposed for the agents to cooperatively learn the best decision. It is shown that when all the agents share a homogeneous distribution of each arm reward, the algorithm achieves guaranteed logarithmic regret for all N agents at the order of O((1 + 2 rho(2))(2) log T/N) when T is large, where rho(2) denotes the second largest among the absolute values of all the eigenvalues of the Metropolis matrix of G. A sufficient condition under which the proposed distributed algorithm learns faster than the centralized (single-agent) counterpart is provided. Simulations suggest that the algorithm also works for the case when the agents have heterogeneous observations of each arm reward.
What problem does this paper attempt to address?