Abstract:Recent advancements in large language models (LLMs) have enabled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity -- a notable achievement is the success of InstructGPT. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evaluated using the WebShop benchmark environment, data, simulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning a pre-trained BERT model with various objectives, learning from preferences without a reward model, and employing contemporary training techniques such as Proximal Policy Optimization (PPO) as used in InstructGPT, and Direct Preference Optimization (DPO). This report also evaluates the RL agents trained using generative trajectories. Evaluations were conducted using Thompson sampling in the WebShop simulator environment. The simulated online experiments demonstrate that agents trained on generated trajectories exhibited comparable task performance to those trained using human trajectories. This has demonstrated an example of an extremely low-cost data-efficient way of training reinforcement learning agents. Also, with limited training time (<2hours), without utilizing any images, a DPO agent achieved a 19% success rate after approximately 3000 steps or 30 minutes of training on T4 GPUs, compared to a PPO agent, which reached a 15% success rate.

LADDER: A Human-Level Bidding Agent for Large-Scale Real-Time Online Auctions

Deep Reinforcement Learning for Strategic Bidding in Electricity Markets

Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Advertising

Deep Reinforcement Learning for Sponsored Search Real-time Bidding

Offline Reinforcement Learning for Optimizing Production Bidding Policies

HiBid: A Cross-Channel Constrained Bidding System with Budget Allocation by Hierarchical Offline Deep Reinforcement Learning

Deep Reinforcement Learning for Sequential Combinatorial Auctions

Neural Auction: End-to-End Learning of Auction Mechanisms for E-Commerce Advertising

Trajectory-wise Iterative Reinforcement Learning Framework for Auto-bidding

Infer Your Enemies and Know Yourself, Learning in Real-Time Bidding with Partially Observable Opponents

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Real-time bidding with multi-agent reinforcement learning in multi-channel display advertising

Autor3: Au tomated R eal-time R anking with R einforcement Learning in E-commerce Sponsored Search Advertising

Sustainable Online Reinforcement Learning for Auto-bidding

Learning to Bid Long-Term: Multi-Agent Reinforcement Learning with Long-Term and Sparse Reward in Repeated Auction Games

Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising

Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena

Deep Landscape Forecasting for Real-time Bidding Advertising

A Multi-Agent Reinforcement Learning Method for Impression Allocation in Online Display Advertising

An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders