Neural Contextual Combinatorial Bandit under Non-stationary Environment

Jiaqi Zheng,Hedi Gao,Haipeng Dai,Zhenzhe Zheng,Fan Wu
DOI: https://doi.org/10.1109/icdm58522.2023.00097
2023-01-01
Abstract:Classic contextual combinatorial multi-armed bandit problems aim to maximize the expected cumulative joint reward in the long run, where a learner plays a set of arms (i.e., a super arm) with time-invariant linear rewards of context features in each round. However, in many real-world applications, linear-reward assumptions often fail to be satisfied and the environment is in general non-stationary, leading to low performance with the bandit models above. Existing works fail to deal with non-linear rewards in the non-stationary environment and the algorithmic challenge remains. In this paper, we initiate the study of a non-stationary neural contextual combinatorial bandit problem, where the reward function of each individual arm can be estimated by a deep neural network based on boundedness assumption and a time-variant reward mapping function. Furthermore, we design an algorithm NNCMAB, which dynamically partitions the context subspace into multiple subspaces and fits reward mapping functions for each subspace by neural networks such that only the models of related subspaces are re-trained when local environment changes happen. NNCMAB can provably achieve $\tilde{O}\left(T^{\frac{3}{4}}+\sqrt{T}N_{c}\right)$ regret, where T is the number of rounds, and $N_{c}$ is a parameter associated with the distribution change. Evaluation results under synthetic and real-world LastFM datasets show that NNCMAB significantly outperforms other state-of-the-art with both linear and non-linear individual rewards under non-stationary environments.
What problem does this paper attempt to address?