Abstract:Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid mistakenly colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. It provides a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, and pose-estimation) to obtain high-accuracy results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of 3D scene completion required for robot manipulation in open - world complex real - world environments. Specifically, the paper proposes a system named **SceneComplete**, which can reconstruct a complete and well - segmented 3D scene model from a single RGB - D image. The key challenges of this problem are: 1. **Complex real - world environments**: Robots need to perform delicate operations in daily cluttered environments (such as homes and hospitals), which are full of occlusions and unknown objects. 2. **Limited input information**: Usually, input data can only be obtained from a single perspective (for example, a single RGB - D image), which makes the reconstruction task essentially under - determined. 3. **No - class assumption**: The system does not make any assumptions about the class, arrangement, or camera perspective of objects and must handle objects of any class. To solve these problems, SceneComplete combines multiple pre - trained large - scale vision models, including: - **Vision - Language Model (VLM)**: Used to recognize and generate short descriptions of objects in the scene. - **Text - based image segmentation model**: Used to locate objects in the image. - **2D image inpainting model**: Used to predict the appearance of occluded parts. - **Image - to - 3D model**: Used to generate complete object meshes. - **Pose estimation module**: Used to assist in combining the predicted meshes into the final scene. Through the collaborative work of these modules, SceneComplete can generate high - quality, fully completed, and accurately segmented object meshes from a single RGB - D image, thus supporting downstream delicate operation tasks such as stable grasping and collision avoidance. ### Main contributions of the paper 1. **For the first time, complete scene reconstruction from a single real - world RGB - D input is achieved**: It can perform reconstruction in cluttered and occluded scenes without making any assumptions about object classes. 2. **High - precision object reconstruction**: Even for partially occluded objects, it can generate high - quality 3D models. 3. **Support for complex robot operations**: The generated 3D models can be used to generate robust grasping suggestions, especially for multi - fingered dexterous hands. ### Experimental verification To verify the effectiveness of SceneComplete, the authors conducted experiments in the following aspects: - **Quantitative evaluation**: Evaluate the reconstruction accuracy on the GraspNet - 1B dataset. - **Qualitative evaluation**: Use actual objects in the laboratory for reconstruction evaluation. - **Task - driven evaluation**: Verify whether the reconstructed object models are accurate enough to support the grasping operations of multi - fingered dexterous hands. The experimental results show that SceneComplete performs significantly better than the baseline method that only uses the convex hull method to complete shapes in complex scenes, especially in reducing collisions during grasping.

SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion

Clio: Real-time Task-Driven Open-Set 3D Scene Graphs

Scene Reconstruction with Functional Objects for Robot Autonomy

Real-time 3D Semantic Scene Perception for Egocentric Robots with Binocular Vision

The Robotic Vision Scene Understanding Challenge

URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images

Zero-Shot Multi-Object Scene Completion

Towards Scene Understanding with Detailed 3D Object Representations

To Complete or to Estimate, That is the Question: A Multi-Task Approach to Depth Completion and Monocular Depth Estimation

You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects

Augmented Environment Representations with Complete Object Models

Multi-Modal 3D Scene Graph Updater for Shared and Dynamic Environments

Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds

Towards Cross-device and Training-free Robotic Grasping in 3D Open World

Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion

Robot Active Neural Sensing and Planning in Unknown Cluttered Environments

Kinematically-Informed Interactive Perception: Robot-Generated 3D Models for Classification

Scene as Occupancy

Reconstructing Interactive 3D Scenes by Panoptic Mapping and CAD Model Alignments