Distributed Multiarmed Bandits

Jingxuan Zhu,Ji Liu
DOI: https://doi.org/10.1109/tac.2023.3247982
IF: 6.549
2023-01-01
IEEE Transactions on Automatic Control
Abstract:This article studies a distributed multiarmed bandit problem with heterogeneous observations of rewards. The problem is cooperatively solved by $N$ agents assuming each agent faces a common set of $M$ arms yet observes only local biased rewards of the arms. The goal of each agent is to minimize the cumulative expected regret with respect to the true rewards of the arms, where the mean of each arm's true reward equals the average of the means of all agents' observed biased rewards. Each agent recursively updates its decision by utilizing the information from its neighbors. Neighbor relationships are described by a time-dependent directed graph $\mathbb{G}(t)$ whose vertices correspond to agents and whose arcs depict neighbor relationships. A fully distributed bandit algorithm is proposed, which couples the classical distributed averaging algorithm and the celebrated upper confidence bound bandit algorithm. It is shown that for any uniformly strongly connected sequence of $\mathbb{G}(t)$, the algorithm achieves guaranteed regret for each agent at the order of $O(\log T)$.
What problem does this paper attempt to address?