CIPPO: Contrastive Imitation Proximal Policy Optimization for Recommendation Based on Reinforcement Learning
Weilong Chen,Shaoliang Zhang,Ruobing Xie,Feng Xia,Leyu Lin,Xinran Zhang,Yan Wang,Yanru Zhang
DOI: https://doi.org/10.1109/tkde.2024.3402649
2024-01-01
Abstract:Recommendation systems, widely adopted in social networks, personalize user experiences through advanced technologies such as Reinforcement Learning (RL), known for producing high-performance, list- wise recommendations. However, RL-based recommendation methods exhibit biases, specifically: 1) Online bias, which stems from a complex real-world online policy composed of various rules and models rather than a single policy; 2) Training bias, a distributional shift resulting from differences between the target policy and the behavior policy . To address these issues, we introduce a novel framework named Contrastive Imitation Proximal Policy Optimization (CIPPO) for recommendation based on RL. This approach leverages extensively labeled feedback data and incorporates a Masked Imitation Network (MIN) that closely emulates the online policy, thus reducing discrepancies between online and offline environments. Additionally, the clipping function in Proximal Policy Optimization, combined with a specially designed contrastive module, effectively reduces the distributional shift between the behavior and target policies. We conduct offline and online experiments to show the improvements of CIPPO, providing details including ablation tests and parameter analysis to validate the effectiveness and robustness. CIPPO gains 12.79% on ACN and in WeChat Top Stories, a large media platform with over 50 million users.