Real-World Offline Reinforcement Learning from Vision Language Model Feedback

Sreyas Venkataraman,Yufei Wang,Ziyu Wang,Zackory Erickson,David Held
2024-11-08
Abstract:Offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions. This makes it ideal for real-world robots and safety-critical scenarios, where collecting online data or expert demonstrations is slow, costly, and risky. However, most existing offline RL works assume the dataset is already labeled with the task rewards, a process that often requires significant human effort, especially when ground-truth states are hard to ascertain (e.g., in the real-world). In this paper, we build on prior work, specifically RL-VLM-F, and propose a novel system that automatically generates reward labels for offline datasets using preference feedback from a vision-language model and a text description of the task. Our method then learns a policy using offline RL with the reward-labeled dataset. We demonstrate the system's applicability to a complex real-world robot-assisted dressing task, where we first learn a reward function using a vision-language model on a sub-optimal offline dataset, and then we use the learned reward to employ Implicit Q learning to develop an effective dressing policy. Our method also performs well in simulation tasks involving the manipulation of rigid and deformable objects, and significantly outperform baselines such as behavior cloning and inverse RL. In summary, we propose a new system that enables automatic reward labeling and policy learning from unlabeled, sub-optimal offline datasets.
Robotics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of **automatically labeling reward labels in offline reinforcement learning (Offline Reinforcement Learning, Offline RL)**, especially in applications in real - world robot tasks. Specifically, the authors propose a new system - **Offline RL - VLM - F**, which can automatically generate reward labels from unlabeled, sub - optimal offline datasets and use these labels to learn effective control strategies. #### Main challenges 1. **Datasets without labeled rewards**: Most existing offline RL methods assume that the datasets already have reward labels. However, in practical applications, especially in complex or real - world tasks, obtaining these labels often requires a large amount of manual labeling, which is both time - consuming and labor - intensive. 2. **Complexity of real - world tasks**: In the real world, especially in tasks involving high - dimensional environments (such as deformable object manipulation), low - level state information is difficult to obtain, making traditional low - level state - based reward function design infeasible. #### Solutions The authors propose a two - stage system: 1. **Reward labeling stage**: Use a vision - language model (Vision - Language Model, VLM) to automatically generate reward labels based on task descriptions and preference feedback between image pairs. Specific steps include: - **Sampling observations**: Randomly sample image pairs from the offline dataset. - **Querying the VLM**: Input the image pairs and task descriptions into the VLM to obtain preference labels. - **Preference - reward learning**: Based on the stored preference labels, use the Bradley - Terry model to learn the reward function. 2. **Policy learning stage**: Use the generated reward labels to label the entire offline dataset and learn control strategies through offline RL algorithms such as implicit Q - learning (Implicit Q - Learning, IQL). #### Experimental verification The authors verified the effectiveness of this method in multiple simulation environments (such as Cartpole, Open Drawer, Soccer, Straighten Rope) and a complex real - world robot - assisted dressing task. The experimental results show that this method performs excellently when dealing with low - quality, unlabeled datasets and even outperforms baseline methods such as behavior cloning (Behavior Cloning) and inverse reinforcement learning (Inverse RL). ### Summary The main contribution of this paper is to provide a new framework that can automatically generate reward labels from sub - optimal offline datasets without manual labeling and successfully apply it to real - world robot tasks. This provides new possibilities for the application of offline reinforcement learning in complex real - world tasks.