Abstract:In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear,(b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists (Wang and Chen, NIPS 2017), our algorithm achieves distribution-dependent regret in the switching case, and distribution-independent regret in the dynamic case, where is the number of switchings and is the sum of the total “distribution changes”, is the total number of arms, and is a gap variable dependent on the distributions of arm outcomes. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter or in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters or but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we apply a new technique to design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.

Combinatorial semi-bandit in the non-stationary environment