Abstract:This paper studies the problem of learning interactive recommender systems from logged feedbacks without any exploration in online environments. We address the problem by proposing a general offline reinforcement learning framework for recommendation, which enables maximizing cumulative user rewards without online exploration. Specifically, we first introduce a probabilistic generative model for interactive recommendation, and then propose an effective inference algorithm for discrete and stochastic policy learning based on logged feedbacks. In order to perform offline learning more effectively, we propose five approaches to minimize the distribution mismatch between the logging policy and recommendation policy: support constraints, supervised regularization, policy constraints, dual constraints and reward extrapolation. We conduct extensive experiments on two public real-world datasets, demonstrating that the proposed methods can achieve superior performance over existing supervised learning and reinforcement learning methods for recommendation.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper studies the problem of how to learn an interactive recommendation system from logged feedback without online exploration. Specifically, the author proposes a general offline reinforcement learning framework to solve the recommendation problem, which can maximize the cumulative user rewards without online exploration. The article mainly addresses the following challenges: 1. **Discrete Stochastic Policy**: - In the testing phase, the recommendation system needs to generate a list containing multiple items according to a discrete policy, rather than just recommending one item. Training a deterministic policy violates the basic principle of machine learning, that is, the testing and training conditions must match. 2. **Extrapolation Error**: - The extrapolation error is a problem in off - policy value learning, caused by the mismatch between the dataset and the true state - action visits of the current policy. This problem is especially serious for recommendation tasks involving a large number of discrete actions. 3. **Unknown Logged Policy**: - User feedback usually comes from a mixture of previous policies, which are unknown. The methods for estimating the logged policy are limited by their ability to accurately estimate the unknown logged policy. To address these challenges, the author proposes a general offline reinforcement learning framework, including five methods such as support constraint, supervised regularization, policy constraint, dual constraint, and reward extrapolation, to reduce the distribution mismatch between the logged policy and the recommended policy. Through extensive experiments on two publicly available real - world datasets, it is verified that the proposed method outperforms existing supervised learning and reinforcement learning methods in recommendation performance.

A General Offline Reinforcement Learning Framework for Interactive Recommendation

Beyond Reward: Offline Preference-guided Policy Optimization

On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender Systems

Generative Inverse Deep Reinforcement Learning for Online Recommendation

Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation

Alleviating Matthew Effect of Offline Reinforcement Learning in Interactive Recommendation

Session-based Interactive Recommendation Via Deep Reinforcement Learning

A Deep Reinforcement Learning Real-Time Recommendation Model Based on Long and Short-Term Preference

ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems

Pseudo Dyna-Q

Generative Adversarial User Model for Reinforcement Learning Based Recommendation System

Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modeling

M 3 Rec: A Context-aware Offline Meta-level Model-based Reinforcement Learning Approach for Cold-Start Recommendation

Towards Off-Policy Reinforcement Learning for Ranking Policies with Human Feedback

Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems

Offline Adaptive Policy Leaning in Real-World Sequential Recommendation Systems

A Reinforcement Learning Framework for Explainable Recommendation

Interactive Search Based on Deep Reinforcement Learning

A stable deep reinforcement learning framework for recommendation

Rethinking Reinforcement Learning for Recommendation: A Prompt Perspective