Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Ivan Kapelyukh,Yifei Ren,Ignacio Alzugaray,Edward Johns
2024-07-30
Abstract:We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.
Robotics,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use visual - language models (VLMs) to achieve language - conditioned 3D object rearrangement without the need to collect example permutation data sets. Specifically, the research objective is that after receiving natural - language instructions, the robot can imagine new scene configurations and evaluate these configurations to select the target state that best conforms to the user's instructions. This involves two key issues: one is how to make the robot imagine new scene configurations; the other is how to evaluate the imagined configurations according to the language commands. The method proposed in the paper is named Dream2Real. It solves the above problems by constructing 3D representations of scenes, generating multiple possible permutations, and using VLMs to evaluate these permutations, thereby achieving zero - sample 3D object rearrangement.