Abstract:Sequential recommendation, where user preference is dynamically inferred from sequential historical behaviors, is a critical task in recommender systems (RSs). To further optimize long-term user engagement, offline reinforcement-learning-based RSs have become a mainstream technique as they provide an additional advantage in avoiding global explorations that may harm online users' experiences. However, previous studies mainly focus on discrete action and policy spaces, which might have difficulties in handling dramatically growing items efficiently. To mitigate this issue, in this paper, we aim to design an algorithmic framework applicable to continuous policies. To facilitate the control in the low-dimensional but dense user preference space, we propose an \underline{\textbf{E}}fficient \underline{\textbf{Co}}ntinuous \underline{\textbf{C}}ontrol framework (ECoC). Based on a statistically tested assumption, we first propose the novel unified action representation abstracted from normalized user and item spaces. Then, we develop the corresponding policy evaluation and policy improvement procedures. During this process, strategic exploration and directional control in terms of unified actions are carefully designed and crucial to final recommendation decisions. Moreover, beneficial from unified actions, the conservatism regularization for policies and value functions are combined and perfectly compatible with the continuous framework. The resulting dual regularization ensures the successful offline training of RL-based recommendation policies. Finally, we conduct extensive experiments to validate the effectiveness of our framework. The results show that compared to the discrete baselines, our ECoC is trained far more efficiently. Meanwhile, the final policies outperform baselines in both capturing the offline data and gaining long-term rewards.

Efficient and Stable Information Directed Exploration for Continuous Reinforcement Learning

Information-Directed Exploration for Deep Reinforcement Learning

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

Careful at Estimation and Bold at Exploration

Deep intrinsically motivated exploration in continuous control

Active Exploration Deep Reinforcement Learning for Continuous Action Space with Forward Prediction

Never Give Up: Learning Directed Exploration Strategies

Guided Exploration in Reinforcement Learning via Monte Carlo Critic Optimization

Efficient Exploration in Continuous-time Model-based Reinforcement Learning

Off-Policy Deep Reinforcement Learning with Analogous Disentangled Exploration

Self-supervised Sequential Information Bottleneck for Robust Exploration in Deep Reinforcement Learning

The Exploration-Exploitation Dilemma Revisited: An Entropy Perspective

Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration

An Efficient Continuous Control Perspective for Reinforcement-Learning-based Sequential Recommendation

A Scalable Derivative-free Exploration Approach for Reinforcement Learning

Intrinsically Guided Exploration in Meta Reinforcement Learning

Efficient Exploration in Resource-Restricted Reinforcement Learning

Backtracking Exploration for Reinforcement Learning

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Satisficing Exploration for Deep Reinforcement Learning