Generative Inverse Deep Reinforcement Learning for Online Recommendation

Xiaocong Chen,Lina Yao,Aixin Sun,Xianzhi Wang,Xiwei Xu,Liming Zhu
DOI: https://doi.org/10.48550/arXiv.2011.02248
2020-11-04
Abstract:Deep reinforcement learning enables an agent to capture user's interest through interactions with the environment dynamically. It has attracted great interest in the recommendation research. Deep reinforcement learning uses a reward function to learn user's interest and to control the learning process. However, most reward functions are manually designed; they are either unrealistic or imprecise to reflect the high variety, dimensionality, and non-linearity properties of the recommendation problem. That makes it difficult for the agent to learn an optimal policy to generate the most satisfactory recommendations. To address the above issue, we propose a novel generative inverse reinforcement learning approach, namely InvRec, which extracts the reward function from user's behaviors automatically, for online recommendation. We conduct experiments on an online platform, VirtualTB, and compare with several state-of-the-art methods to demonstrate the feasibility and effectiveness of our proposed approach.
Information Retrieval,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically infer the reward function from user behaviors in online recommendation systems, thereby overcoming the limitations of manually designing the reward function in existing deep reinforcement learning methods. Specifically, the paper proposes a Generative Inverse Deep Reinforcement Learning method (InvRec), aiming to automatically generate the reward function through user behavior data, and then optimize the recommendation strategy to generate the recommended content that best suits the user's interests. This method can not only improve the performance of the recommendation system but also enhance the generalization ability of the system to adapt to the complex online recommendation environment. ### Main Contributions 1. **Propose Generative Inverse Deep Reinforcement Learning**: Automatically learn the reward function for online recommendation. This is the first time that the reward function and the agent are decoupled and applied to online recommendation. 2. **Design a new Actor - Discriminator Network Module**: Use the discriminator as the critic network and a new Actor - Critic network as the actor network to implement the proposed framework. This module does not require model assumptions and is easy to be generalized to multiple scenarios. 3. **Experimental Verification**: Conduct experiments on the virtual online platform VirtualTB to prove the effectiveness and feasibility of the proposed method. The experimental results show that this method is superior to several state - of - the - art methods in terms of click - through rate. ### Method Overview - **Inverse Reinforcement Learning**: Reverse - derive the reward function through user behavior data, avoiding the complexity and inaccuracy of manually defining the reward function. - **Generative Adversarial Network (GAN)**: Used to generate diverse recommendation strategies and improve the generalization ability of the system. - **Actor - Critic Network**: Combine the policy gradient of the actor network and the Q - learning of the critic network to form an efficient learning framework. - **Policy Optimization**: Use Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) to update policy parameters to ensure that the new policy is better than the old policy. ### Experimental Results - **Performance Comparison**: On the VirtualTB platform, the proposed method is superior to the other four latest recommendation methods in terms of click - through rate and average reward per step. - **Parameter Influence**: Study the influence of key parameters (such as λ of GAE and ϵ of PPO) on performance and find the optimal settings. ### Discussion This paper provides a new online recommendation method based on inverse reinforcement learning. It does not require manual definition of the reward function and is suitable for various practical recommendation scenarios, especially those where the reward function is difficult to define or highly dependent on specific domains. This method automatically generates an adaptive unknown reward function through a small amount of user behavior data and automatically finds the optimal strategy to generate the recommended content that best suits the user's interests. The experimental results show that this method is competitive in performance and is expected to accelerate the practical application of reinforcement learning in complex environments.