SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

Aditya Agarwal,Gaurav Singh,Bipasha Sen,Tomás Lozano-Pérez,Leslie Pack Kaelbling
2024-10-31
Abstract:Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid mistakenly colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. It provides a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, and pose-estimation) to obtain high-accuracy results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand.
Robotics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of 3D scene completion required for robot manipulation in open - world complex real - world environments. Specifically, the paper proposes a system named **SceneComplete**, which can reconstruct a complete and well - segmented 3D scene model from a single RGB - D image. The key challenges of this problem are: 1. **Complex real - world environments**: Robots need to perform delicate operations in daily cluttered environments (such as homes and hospitals), which are full of occlusions and unknown objects. 2. **Limited input information**: Usually, input data can only be obtained from a single perspective (for example, a single RGB - D image), which makes the reconstruction task essentially under - determined. 3. **No - class assumption**: The system does not make any assumptions about the class, arrangement, or camera perspective of objects and must handle objects of any class. To solve these problems, SceneComplete combines multiple pre - trained large - scale vision models, including: - **Vision - Language Model (VLM)**: Used to recognize and generate short descriptions of objects in the scene. - **Text - based image segmentation model**: Used to locate objects in the image. - **2D image inpainting model**: Used to predict the appearance of occluded parts. - **Image - to - 3D model**: Used to generate complete object meshes. - **Pose estimation module**: Used to assist in combining the predicted meshes into the final scene. Through the collaborative work of these modules, SceneComplete can generate high - quality, fully completed, and accurately segmented object meshes from a single RGB - D image, thus supporting downstream delicate operation tasks such as stable grasping and collision avoidance. ### Main contributions of the paper 1. **For the first time, complete scene reconstruction from a single real - world RGB - D input is achieved**: It can perform reconstruction in cluttered and occluded scenes without making any assumptions about object classes. 2. **High - precision object reconstruction**: Even for partially occluded objects, it can generate high - quality 3D models. 3. **Support for complex robot operations**: The generated 3D models can be used to generate robust grasping suggestions, especially for multi - fingered dexterous hands. ### Experimental verification To verify the effectiveness of SceneComplete, the authors conducted experiments in the following aspects: - **Quantitative evaluation**: Evaluate the reconstruction accuracy on the GraspNet - 1B dataset. - **Qualitative evaluation**: Use actual objects in the laboratory for reconstruction evaluation. - **Task - driven evaluation**: Verify whether the reconstructed object models are accurate enough to support the grasping operations of multi - fingered dexterous hands. The experimental results show that SceneComplete performs significantly better than the baseline method that only uses the convex hull method to complete shapes in complex scenes, especially in reducing collisions during grasping.