An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders

Shuang Feng,Grace Feng

2024-08-28

Abstract:Recent advancements in large language models (LLMs) have enabled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity -- a notable achievement is the success of InstructGPT. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evaluated using the WebShop benchmark environment, data, simulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning a pre-trained BERT model with various objectives, learning from preferences without a reward model, and employing contemporary training techniques such as Proximal Policy Optimization (PPO) as used in InstructGPT, and Direct Preference Optimization (DPO). This report also evaluates the RL agents trained using generative trajectories. Evaluations were conducted using Thompson sampling in the WebShop simulator environment. The simulated online experiments demonstrate that agents trained on generated trajectories exhibited comparable task performance to those trained using human trajectories. This has demonstrated an example of an extremely low-cost data-efficient way of training reinforcement learning agents. Also, with limited training time (<2hours), without utilizing any images, a DPO agent achieved a 19% success rate after approximately 3000 steps or 30 minutes of training on T4 GPUs, compared to a PPO agent, which reached a 15% success rate.

Machine Learning,Artificial Intelligence,Information Retrieval

What problem does this paper attempt to address?

The paper aims to address the training problem of reinforcement learning (RL) agents in e-commerce recommendation systems, particularly how to leverage large language models (LLMs) to improve long-term user satisfaction and avoid short-term goal-oriented issues. Specifically, the paper attempts to solve the problem through the following points: 1. **Data Efficiency**: Investigate how to train efficient RL agents with limited data and computational resources. The paper compares two reinforcement learning algorithms—Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO)—and finds that DPO outperforms PPO in terms of data efficiency and task performance. 2. **Utilization of Generated Trajectories**: Explore the possibility of using automatically generated trajectories for training to reduce reliance on expensive human data. The study shows that DPO agents trained with generated trajectories can achieve task performance comparable to DPO agents trained with human trajectories. 3. **Algorithm Comparison**: The paper implements and evaluates both PPO and DPO algorithms in the WebShop benchmark environment. The results indicate that DPO can achieve a higher success rate given the same training time and computational resources. Through these studies, the paper demonstrates that even with extremely short training times (<1 hour), DPO-based RL agents can significantly outperform PPO. Additionally, it proposes the use of generated data for training, which helps alleviate the problem of data scarcity in practical applications.

An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

AdaRec: Adaptive Sequential Recommendation for Reinforcing Long-term User Engagement

Deep Reinforcement Learning for Sequential Targeting

RLRF4Rec: Reinforcement Learning from Recsys Feedback for Enhanced Recommendation Reranking

RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising

Rethinking Reinforcement Learning for Recommendation: A Prompt Perspective

Intrinsically Motivated Reinforcement Learning Based Recommendation with Counterfactual Data Augmentation

Fine-Grained Session Recommendations in E-commerce using Deep Reinforcement Learning

Toward Simulating Environments in Reinforcement Learning Based Recommendations.

Learning to Generate Better Than Your LLM

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Deep Reinforcement Learning for List-wise Recommendations

Optimized Recommender Systems with Deep Reinforcement Learning

Generative Inverse Deep Reinforcement Learning for Online Recommendation

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

On Generative Agents in Recommendation

Deep Reinforcement Learning-Based Product Recommender for Online Advertising

Robust Reinforcement Learning Objectives for Sequential Recommender Systems

Generative Adversarial User Model for Reinforcement Learning Based Recommendation System

Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs