On the Analysis of Two-Stage Stochastic Bandit
Yumou Liu,Haoming Li,Zhenzhe Zheng,Fan Wu,Guihai Chen
DOI: https://doi.org/10.1145/3641512.3686360
2024-01-01
Abstract:Two-stage bandit-based algorithms have found widespread application in modern online platforms, offering a balance between cost and accuracy. The initial stage involves coarse filtering of a small candidate set of promising items from a large corpus, while the subsequent stage refines the selection and presents a single item to the user. In this work, to the best of our knowledge, we for the first time undertake a theoretical analysis of the two-stage stochastic multi-armed bandit problem. Specifically, we model the two-stage bandit problem as a two-stage online optimization, and conduct a theoretical analysis. We demonstrate that while the optimization objective of the first stage may seem intuitive, it is, in fact, non-trivial. We devise a proxy optimization objective, emphasize the importance of a carefully designed exploration strategy, and establish the theoretical analysis for the application of Upper Confidence Bound (UCB)-based algorithms in the first stage. Furthermore, we provide a regret analysis of the proposed two-stage bandit algorithm, demonstrating a gap-dependent upper bound of [EQUATION], where [EQUATION] is the largest reward gap, and a gap-independent lower bound of [EQUATION], where n represents the horizon.