A Thompson Sampling Algorithm with Logarithmic Regret for Unimodal Gaussian Bandit.
Long Yang,Zhao Li,Zehong Hu,Shasha Ruan,Gang Pan
DOI: https://doi.org/10.1109/tnnls.2023.3295360
IF: 14.255
2023-01-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:In this article, we propose a Thompson sampling algorithm with Gaussian prior for unimodal bandit under Gaussian reward setting, where the expected reward is unimodal over the partially ordered arms. To exploit the unimodal structure better, at each step, instead of exploration from the entire decision space, the proposed algorithm makes decisions according to posterior distribution only in the arm's neighborhood with the highest empirical mean estimate. We theoretically prove that the asymptotic regret of our algorithm reaches O(logT) , i.e., it shares the same regret order with asymptotic optimal algorithms, which is comparable to extensive existing state-of-the-art unimodal multiarm bandit (U-MAB) algorithms. Finally, we use extensive experiments to demonstrate the effectiveness of the proposed algorithm on both synthetic datasets and real-world applications.