Second-order Multi-Armed Bandit Learning for Online Optimization in Communication and Networks

Zhiyong Du,Bin Jiang,Kun Xu,Shengyun Wei,Shengqing Wang,Huatao Zhu
DOI: https://doi.org/10.1145/3321408.3323078
2019-01-01
Abstract:Multi-armed bandit (MAB) based reinforcement learning, which is able to learn in dynamic and uncertain environments with analytic performance bound, provides a robust optimization framework for resource optimization/scheduling problems in communication and networks. The goal of MAB problem is to learn the best arms, i.e., the arms provide the largest reward mean when played. In actual communication systems, not only the mean (i.e., the first-order statistic), but also the second-order dynamics of reward is important, since a larger dynamic range may result in more frequent reconfiguration or adaptation of systems, and user quality of experience (QoE) degradation. However, traditional MAB models did not consider the second-order dynamic of reward, failing to provide tailored characterization when applied in communications. Motivated by this issue, this paper first proposes a second-order MAB problem. Specifically, a new best arm metric and associated regret that take the second-order dynamics of reward into account explicitly are redefined. Then, a second-order learning algorithm is designed. We further prove that the proposed algorithm is order-optimal. Finally, some simulation results are presented to validate the proposed algorithm. The second-order MAB model and algorithm enable a more fine-grained characterization of resource optimization/scheduling problems in communication and networks.
What problem does this paper attempt to address?