Abstract:Offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions. This makes it ideal for real-world robots and safety-critical scenarios, where collecting online data or expert demonstrations is slow, costly, and risky. However, most existing offline RL works assume the dataset is already labeled with the task rewards, a process that often requires significant human effort, especially when ground-truth states are hard to ascertain (e.g., in the real-world). In this paper, we build on prior work, specifically RL-VLM-F, and propose a novel system that automatically generates reward labels for offline datasets using preference feedback from a vision-language model and a text description of the task. Our method then learns a policy using offline RL with the reward-labeled dataset. We demonstrate the system's applicability to a complex real-world robot-assisted dressing task, where we first learn a reward function using a vision-language model on a sub-optimal offline dataset, and then we use the learned reward to employ Implicit Q learning to develop an effective dressing policy. Our method also performs well in simulation tasks involving the manipulation of rigid and deformable objects, and significantly outperform baselines such as behavior cloning and inverse RL. In summary, we propose a new system that enables automatic reward labeling and policy learning from unlabeled, sub-optimal offline datasets.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of **automatically labeling reward labels in offline reinforcement learning (Offline Reinforcement Learning, Offline RL)**, especially in applications in real - world robot tasks. Specifically, the authors propose a new system - **Offline RL - VLM - F**, which can automatically generate reward labels from unlabeled, sub - optimal offline datasets and use these labels to learn effective control strategies. #### Main challenges 1. **Datasets without labeled rewards**: Most existing offline RL methods assume that the datasets already have reward labels. However, in practical applications, especially in complex or real - world tasks, obtaining these labels often requires a large amount of manual labeling, which is both time - consuming and labor - intensive. 2. **Complexity of real - world tasks**: In the real world, especially in tasks involving high - dimensional environments (such as deformable object manipulation), low - level state information is difficult to obtain, making traditional low - level state - based reward function design infeasible. #### Solutions The authors propose a two - stage system: 1. **Reward labeling stage**: Use a vision - language model (Vision - Language Model, VLM) to automatically generate reward labels based on task descriptions and preference feedback between image pairs. Specific steps include: - **Sampling observations**: Randomly sample image pairs from the offline dataset. - **Querying the VLM**: Input the image pairs and task descriptions into the VLM to obtain preference labels. - **Preference - reward learning**: Based on the stored preference labels, use the Bradley - Terry model to learn the reward function. 2. **Policy learning stage**: Use the generated reward labels to label the entire offline dataset and learn control strategies through offline RL algorithms such as implicit Q - learning (Implicit Q - Learning, IQL). #### Experimental verification The authors verified the effectiveness of this method in multiple simulation environments (such as Cartpole, Open Drawer, Soccer, Straighten Rope) and a complex real - world robot - assisted dressing task. The experimental results show that this method performs excellently when dealing with low - quality, unlabeled datasets and even outperforms baseline methods such as behavior cloning (Behavior Cloning) and inverse reinforcement learning (Inverse RL). ### Summary The main contribution of this paper is to provide a new framework that can automatically generate reward labels from sub - optimal offline datasets without manual labeling and successfully apply it to real - world robot tasks. This provides new possibilities for the application of offline reinforcement learning in complex real - world tasks.

Real-World Offline Reinforcement Learning from Vision Language Model Feedback

Semi-supervised reward learning for offline reinforcement learning

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Reward-free Offline Reinforcement Learning

Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations

Self-Supervised Imitation for Offline Reinforcement Learning with Hindsight Relabeling

End-to-End Robotic Reinforcement Learning without Reward Engineering

Deploying Offline Reinforcement Learning with Human Feedback

Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints

Effective Offline Robot Learning with Structured Task Graph

Optimal Reward Labeling: Bridging Offline Preference and Reward-Based Reinforcement Learning

Reward-free Offline Reinforcement Learning: Optimizing Behavior Policy Via Action Exploration

Implicit Offline Reinforcement Learning via Supervised Learning

Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning

A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems

Efficient Online Reinforcement Learning with Offline Data

Leveraging Optimal Transport for Enhanced Offline Reinforcement Learning in Surgical Robotic Environments

Model-Based Offline Planning

SDV: Simple Double Validation Model-based Offline Reinforcement Learning

Offline Adaptive Policy Leaning in Real-World Sequential Recommendation Systems

Robotic Offline RL from Internet Videos via Value-Function Pre-Training