Abstract:Language-guided robotic grasping is a rapidly advancing field where robots are instructed using human language to grasp specific objects. However, existing methods often depend on dense camera views and struggle to quickly update scenes, limiting their effectiveness in changeable environments. In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping system that operates efficiently with sparse-view RGB images and handles scene updates fastly. Our system builds upon and significantly enhances existing computer vision modules in robotic learning. Specifically, SparseGrasp utilizes DUSt3R to generate a dense point cloud as the initialization for 3D Gaussian Splatting (3DGS), maintaining high fidelity even under sparse supervision. Importantly, SparseGrasp incorporates semantic awareness from recent vision foundation models. To further improve processing efficiency, we repurpose Principal Component Analysis (PCA) to compress features from 2D models. Additionally, we introduce a novel render-and-compare strategy that ensures rapid scene updates, enabling multi-turn grasping in changeable environments. Experimental results show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability, providing a robust solution for multi-turn grasping in changeable environment.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems of scene reconstruction and update efficiency in language - guided robotic grasping under sparse viewpoints. Specifically, existing methods usually rely on images from dense viewpoints to reconstruct scenes, and it is difficult to quickly update the scene when the environment changes, which limits their effectiveness and adaptability in dynamic environments. #### Main problems include: 1. **Dependence on Dense Viewpoints**: Existing methods such as F3RM and LERF - TOGO require a large number of multi - viewpoint images for scene reconstruction, which not only increases the time cost of data collection but also makes it difficult for robots to adapt to environmental changes in real - time in practical applications. 2. **Difficulty in Quickly Updating the Scene**: When the environment changes (for example, when an object is moved), these methods need to recapture a large number of images and perform a complete scene reconstruction, resulting in slow processing speed and being unable to meet the requirements of multi - round grasping tasks. 3. **Insufficient Semantic Information Extraction**: Existing methods face challenges in extracting semantic information from sparse - viewpoint images, especially performing poorly in terms of geometric accuracy and semantic alignment. 4. **Low Grasping Precision**: Due to the above problems, the success rate of existing methods in grasping specific objects under sparse viewpoints is relatively low, especially in complex or dynamic environments. ### Solutions To solve these problems, the authors propose the SparseGrasp system, and its main contributions are as follows: 1. **Fast Scene Reconstruction and Update**: SparseGrasp can achieve fast scene reconstruction (about 240 seconds) and update (about 200 milliseconds) using only sparse - viewpoint RGB images, thus overcoming the dependence of existing methods on dense viewpoints. 2. **3D Semantic Gaussian Point Painting**: By combining DUSt3R to generate a dense point cloud initialization, and using MaskCLIP and SAM to efficiently extract dense semantic features, and then compressing the feature dimension through PCA, SparseGrasp improves the extraction and fusion efficiency of semantic information. 3. **Improved Grasp Generation**: SparseGrasp generates grasping postures directly from 3D Gaussian point painting (3DGS), avoiding the voxelization and depth back - projection steps required in traditional methods, thereby improving the speed and accuracy of grasping generation. 4. **Rendering and Comparison Strategy**: For the case of object position changes in the scene, SparseGrasp introduces a "rendering and comparison" strategy, which can quickly update the scene representation without performing a complete scene reconstruction. Through these innovations, SparseGrasp significantly improves the grasping performance of robots in static and dynamic environments, especially achieving a higher grasping success rate and faster scene update speed under sparse - viewpoint conditions.

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

A Cascaded Deep Learning Framework for Real-time and Robust Grasp Planning

GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping

GraspSplats: Efficient Manipulation with 3D Feature Splatting

RGBGrasp: Image-based Object Grasping by Capturing Multiple Views during Robot Arm Movement with Neural Radiance Fields

A Robotic Semantic Grasping Method for Pick-and-place Tasks

MVGrasp: Real-time multi-view 3D object grasping in highly cluttered environments

MonoGraspNet: 6-DoF Grasping with a Single RGB Image

ASGrasp: Generalizable Transparent Object Reconstruction and 6-Dof Grasp Detection from RGB-D Active Stereo Camera

S4G: Amodal Single-view Single-Shot SE(3) Grasp Detection in Cluttered Scenes

A System of Robotic Grasping with Experience Acquisition.

6D Pose Estimation with Combined Deep Learning and 3D Vision Techniques for a Fast and Accurate Object Grasping

Unseen Object Few-Shot Semantic Segmentation for Robotic Grasping

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

SparseLGS: Sparse View Language Embedded Gaussian Splatting

SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation

ASGrasp: Generalizable Transparent Object Reconstruction and Grasping from RGB-D Active Stereo Camera

Robotic Grasping Detection Algorithm Based on 3D Vision Dual-Stream Encoding Strategy

Vision-Based Intelligent Robot Grasping Using Sparse Neural Network

A Vision-based Robot Grasping System