Abstract:Accurate grasping is the key to several robotic tasks including assembly and household robotics. Executing a successful grasp in a cluttered environment requires multiple levels of scene understanding: First, the robot needs to analyze the geometric properties of individual objects to find feasible grasps. These grasps need to be compliant with the local object geometry. Second, for each proposed grasp, the robot needs to reason about the interactions with other objects in the scene. Finally, the robot must compute a collision-free grasp trajectory while taking into account the geometry of the target object. Most grasp detection algorithms directly predict grasp poses in a monolithic fashion, which does not capture the composability of the environment. In this paper, we introduce an end-to-end architecture for object-centric grasping. The method uses pointcloud data from a single arbitrary viewing direction as an input and generates an instance-centric representation for each partially observed object in the scene. This representation is further used for object reconstruction and grasp detection in cluttered table-top scenes. We show the effectiveness of the proposed method by extensively evaluating it against state-of-the-art methods on synthetic datasets, indicating superior performance for grasping and reconstruction. Additionally, we demonstrate real-world applicability by decluttering scenes with varying numbers of objects.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the problem of accurately and collision-free grasping of objects by robots in cluttered environments. Specifically, the authors propose a new method, ICGNet (Instance-Centric Grasping Network), to achieve object-level grasping and shape prediction. This method primarily addresses the following issues: 1. **Multi-level Scene Understanding**: - **Geometric Attribute Analysis**: The robot needs to analyze the geometric attributes of individual objects to find feasible grasp points. These grasp points need to match the local geometric structure of the objects. - **Interaction Reasoning**: For each proposed grasp point, the robot needs to consider interactions with other objects. - **Collision-free Trajectory Planning**: The robot must compute a collision-free grasping path while considering the geometric structure of the target object. 2. **Limitations of Existing Methods**: - **Overall Scene Processing**: Most existing grasp detection algorithms predict grasp poses directly from the overall scene, which fails to capture the composability of the environment. - **Complexity and Suboptimal Predictions**: Many methods rely on given segmentation masks, object templates, etc., which add extra complexity and may lead to suboptimal predictions in cases of severe occlusion. 3. **Goal-driven Grasping**: - **Instance-level Grasping**: Existing methods usually do not distinguish between individual object instances, making goal-driven grasping tasks (such as "grasp the 1st object" or "grasp the bottle") difficult. - **Collision Detection**: Existing methods may not ensure collision-free conditions between objects after grasping. ### Solution The authors propose an end-to-end architecture that generates instance-level representations of observed objects from single-viewpoint point cloud data. This representation is further used for object reconstruction and grasp detection. Specifically, the main contributions of ICGNet include: 1. **Instance-level Feature Extraction**: - Using sparse feature volumes, combining voxel and surface features to extract instance-level information for each object. - Refining instance queries through iterative mask cross-attention and self-attention mechanisms, allowing each potential query to focus on specific instances. 2. **Contact Point Grasp Representation**: - Proposing a contact point-based grasp representation method that allows predicting the success probability of different grasp directions, thereby generating more diverse grasp proposals. 3. **Multi-task Joint Prediction**: - Simultaneously predicting the semantic category, instance labels, occupancy values, and grasp possibilities of objects, providing a unified approach to handle scene understanding, reconstruction, and grasp detection. ### Experimental Results The authors validate the effectiveness of ICGNet through extensive experiments on synthetic datasets. The experimental results show that ICGNet outperforms existing methods in terms of Grasp Success Rate (GSR), Decluttering Rate (DR), and reconstruction performance (Chamfer L1 distance and IoU). Additionally, the authors conduct experiments on real-world data, further validating the practical application value of the method. ### Conclusion ICGNet significantly improves the grasping performance of robots in cluttered environments through instance-level grasping and shape prediction, especially excelling in goal-driven grasping tasks. This method not only achieves excellent performance on synthetic datasets but also demonstrates strong practical application capabilities in real-world experiments.

ICGNet: A Unified Approach for Instance-Centric Grasping

Efficient and Robust Robotic Grasping in Cluttered Scenes: A Point Cloud-Based Approach with Heuristic Evaluation.

MVGrasp: Real-time multi-view 3D object grasping in highly cluttered environments

GPR: Grasp Pose Refinement Network for Cluttered Scenes

CenterGrasp: Object-Aware Implicit Representation Learning for Simultaneous Shape Reconstruction and 6-DoF Grasp Estimation

Using Geometry to Detect Grasps in 3D Point Clouds

Graspness Discovery in Clutters for Fast and Accurate Grasp Detection

A Real-Time Grasping Detection Network Architecture for Various Grasping Scenarios

GoalGrasp: Grasping Goals in Partially Occluded Scenarios without Grasp Training

S4G: Amodal Single-view Single-Shot SE(3) Grasp Detection in Cluttered Scenes

A Novel Geometry-based Algorithm for Robust Grasping in Extreme Clutter Environment

Robotic Continuous Grasping System by Shape Transformer-Guided Multi-Object Category-Level 6D Pose Estimation

You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects

A Vision-based Robot Grasping System

Two-stage Grasp Detection Method for Robotics Using Point Clouds and Deep Hierarchical Feature Learning Network

GR-ConvNet v2: A Real-Time Multi-Grasp Detection Network for Robotic Grasping

Generalized Grasping for Mechanical Grippers for Unknown Objects with Partial Point Cloud Representations

GraNet: A Multi-Level Graph Network for 6-DoF Grasp Pose Generation in Cluttered Scenes

Triplane Grasping: Efficient 6-DoF Grasping with Single RGB Images

CMG-Net: An End-to-End Contact-Based Multi-Finger Dexterous Grasping Network

Edge Grasp Network: A Graph-Based SE(3)-invariant Approach to Grasp Detection