A General Offline Reinforcement Learning Framework for Interactive Recommendation

Teng Xiao,Donglin Wang
2023-10-01
Abstract:This paper studies the problem of learning interactive recommender systems from logged feedbacks without any exploration in online environments. We address the problem by proposing a general offline reinforcement learning framework for recommendation, which enables maximizing cumulative user rewards without online exploration. Specifically, we first introduce a probabilistic generative model for interactive recommendation, and then propose an effective inference algorithm for discrete and stochastic policy learning based on logged feedbacks. In order to perform offline learning more effectively, we propose five approaches to minimize the distribution mismatch between the logging policy and recommendation policy: support constraints, supervised regularization, policy constraints, dual constraints and reward extrapolation. We conduct extensive experiments on two public real-world datasets, demonstrating that the proposed methods can achieve superior performance over existing supervised learning and reinforcement learning methods for recommendation.
Machine Learning,Information Retrieval
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper studies the problem of how to learn an interactive recommendation system from logged feedback without online exploration. Specifically, the author proposes a general offline reinforcement learning framework to solve the recommendation problem, which can maximize the cumulative user rewards without online exploration. The article mainly addresses the following challenges: 1. **Discrete Stochastic Policy**: - In the testing phase, the recommendation system needs to generate a list containing multiple items according to a discrete policy, rather than just recommending one item. Training a deterministic policy violates the basic principle of machine learning, that is, the testing and training conditions must match. 2. **Extrapolation Error**: - The extrapolation error is a problem in off - policy value learning, caused by the mismatch between the dataset and the true state - action visits of the current policy. This problem is especially serious for recommendation tasks involving a large number of discrete actions. 3. **Unknown Logged Policy**: - User feedback usually comes from a mixture of previous policies, which are unknown. The methods for estimating the logged policy are limited by their ability to accurately estimate the unknown logged policy. To address these challenges, the author proposes a general offline reinforcement learning framework, including five methods such as support constraint, supervised regularization, policy constraint, dual constraint, and reward extrapolation, to reduce the distribution mismatch between the logged policy and the recommended policy. Through extensive experiments on two publicly available real - world datasets, it is verified that the proposed method outperforms existing supervised learning and reinforcement learning methods in recommendation performance.