SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

Junqiu Yu,Xinlin Ren,Yongchong Gu,Haitao Lin,Tianyu Wang,Yi Zhu,Hang Xu,Yu-Gang Jiang,Xiangyang Xue,Yanwei Fu
2024-12-03
Abstract:Language-guided robotic grasping is a rapidly advancing field where robots are instructed using human language to grasp specific objects. However, existing methods often depend on dense camera views and struggle to quickly update scenes, limiting their effectiveness in changeable environments. In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping system that operates efficiently with sparse-view RGB images and handles scene updates fastly. Our system builds upon and significantly enhances existing computer vision modules in robotic learning. Specifically, SparseGrasp utilizes DUSt3R to generate a dense point cloud as the initialization for 3D Gaussian Splatting (3DGS), maintaining high fidelity even under sparse supervision. Importantly, SparseGrasp incorporates semantic awareness from recent vision foundation models. To further improve processing efficiency, we repurpose Principal Component Analysis (PCA) to compress features from 2D models. Additionally, we introduce a novel render-and-compare strategy that ensures rapid scene updates, enabling multi-turn grasping in changeable environments. Experimental results show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability, providing a robust solution for multi-turn grasping in changeable environment.
Robotics,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of scene reconstruction and update efficiency in language - guided robotic grasping under sparse viewpoints. Specifically, existing methods usually rely on images from dense viewpoints to reconstruct scenes, and it is difficult to quickly update the scene when the environment changes, which limits their effectiveness and adaptability in dynamic environments. #### Main problems include: 1. **Dependence on Dense Viewpoints**: Existing methods such as F3RM and LERF - TOGO require a large number of multi - viewpoint images for scene reconstruction, which not only increases the time cost of data collection but also makes it difficult for robots to adapt to environmental changes in real - time in practical applications. 2. **Difficulty in Quickly Updating the Scene**: When the environment changes (for example, when an object is moved), these methods need to recapture a large number of images and perform a complete scene reconstruction, resulting in slow processing speed and being unable to meet the requirements of multi - round grasping tasks. 3. **Insufficient Semantic Information Extraction**: Existing methods face challenges in extracting semantic information from sparse - viewpoint images, especially performing poorly in terms of geometric accuracy and semantic alignment. 4. **Low Grasping Precision**: Due to the above problems, the success rate of existing methods in grasping specific objects under sparse viewpoints is relatively low, especially in complex or dynamic environments. ### Solutions To solve these problems, the authors propose the SparseGrasp system, and its main contributions are as follows: 1. **Fast Scene Reconstruction and Update**: SparseGrasp can achieve fast scene reconstruction (about 240 seconds) and update (about 200 milliseconds) using only sparse - viewpoint RGB images, thus overcoming the dependence of existing methods on dense viewpoints. 2. **3D Semantic Gaussian Point Painting**: By combining DUSt3R to generate a dense point cloud initialization, and using MaskCLIP and SAM to efficiently extract dense semantic features, and then compressing the feature dimension through PCA, SparseGrasp improves the extraction and fusion efficiency of semantic information. 3. **Improved Grasp Generation**: SparseGrasp generates grasping postures directly from 3D Gaussian point painting (3DGS), avoiding the voxelization and depth back - projection steps required in traditional methods, thereby improving the speed and accuracy of grasping generation. 4. **Rendering and Comparison Strategy**: For the case of object position changes in the scene, SparseGrasp introduces a "rendering and comparison" strategy, which can quickly update the scene representation without performing a complete scene reconstruction. Through these innovations, SparseGrasp significantly improves the grasping performance of robots in static and dynamic environments, especially achieving a higher grasping success rate and faster scene update speed under sparse - viewpoint conditions.