Counterfactual contextual bandit for recommendation under delayed feedback

DOI: https://doi.org/10.1007/s00521-024-09800-0
2024-05-10
Neural Computing and Applications
Abstract:The recommendation system has far-reaching significance and great practical value, which alleviates people's troubles about choosing from a huge amount of information. The existing recommendation system usually faces the selection bias problem due to the ignorance of samples with delayed feedback. To alleviate this problem, by modeling the recommendation as a batch contextual bandit problem, we propose a counterfactual reward estimation approach in this work. First, we formalize the counterfactual problem as "would the user be interested in the recommended item if the delayed time is before the collection time point?". The above counterfactual reward is estimated in a survival analysis framework, by fully exploring the causal generation process of user feedback on batch data. Second, based on the above estimated counterfactual rewards, the policy of batch contextual bandit is updated for online recommendation in the next episode. Third, new batch data are generated in the online recommendation for further counterfactual reward estimation. The above three steps are iteratively conducted until the optimal policy is learned. We also prove the sub-linear regret bound of the learned bandit policy theoretically. Our method achieved a improvement in average reward compared to the baseline methods in experiments conducted on synthetic and Criteo datasets, demonstrating the efficacy of our approach.
computer science, artificial intelligence
What problem does this paper attempt to address?