Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting

Ce Hao,Kelvin Lin,Siyuan Luo,Harold Soh
2024-09-23
Abstract:Diffusion policies have demonstrated robust performance in generative modeling, prompting their application in robotic manipulation controlled via language descriptions. In this paper, we introduce a zero-shot, open-vocabulary diffusion policy method for robot manipulation. Using Vision-Language Models (VLMs), our method transforms linguistic task descriptions into actionable keyframes in 3D space. These keyframes serve to guide the diffusion process via inpainting. However, naively enforcing the diffusion process to adhere to the generated keyframes is problematic: the keyframes from the VLMs may be incorrect and lead to action sequences where the diffusion model performs poorly. To address these challenges, we develop an inpainting optimization strategy that balances adherence to the keyframes v.s. the training data distribution. Experimental evaluations demonstrate that our approach surpasses the performance of traditional fine-tuned language-conditioned methods in both simulated and real-world settings.
Robotics
What problem does this paper attempt to address?
The paper aims to address the problem of guiding task execution in robotic operations using natural language descriptions. Specifically, the authors propose a method called DISCO (Diffusion Inpainting with Semantic Keyframes and Constrained Optimization), which is a zero-shot, open-vocabulary diffusion strategy method. By integrating Visual Language Models (VLMs), language instructions are transformed into keyframes in 3D space, thereby guiding the robot to complete tasks. The main objectives include: 1. **Task execution guided by language**: Allowing robots to perform specific tasks based on natural language descriptions without the need for retraining or fine-tuning for each new task. 2. **Handling open-vocabulary task descriptions**: Being able to understand and execute unseen language instructions, which is a challenge in traditional fine-tuning methods. 3. **Improving trajectory generation**: Proposing an optimization method to address issues caused by inaccurate keyframes, ensuring the generation of reasonable action sequences even when faced with uncertain or novel task descriptions. Experimental results show that DISCO not only performs comparably to existing methods on known tasks but also significantly improves success rates when facing new tasks. This is particularly notable in scenarios requiring additional observational information or understanding of commonsense language instructions. Furthermore, DISCO demonstrated its effectiveness in real robotic grasping tasks.