Abstract:Reliable object grasping is one of the fundamental tasks in robotics. However, determining grasping pose based on single-image input has long been a challenge due to limited visual information and the complexity of real-world objects. In this paper, we propose Triplane Grasping, a fast grasping decision-making method that relies solely on a single RGB-only image as input. Triplane Grasping creates a hybrid Triplane-Gaussian 3D representation through a point decoder and a triplane decoder, which produce an efficient and high-quality reconstruction of the object to be grasped to meet real-time grasping requirements. We propose to use an end-to-end network to generate 6-DoF parallel-jaw grasp distributions directly from 3D points in the point cloud as potential grasp contacts and anchor the grasp pose in the observed data. Experiments demonstrate that our method achieves rapid modeling and grasping pose decision-making for daily objects, and exhibits a high grasping success rate in zero-shot scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently generate reliable 6 - Degree - of - Freedom (6 - DoF) grasping postures based on a single RGB image in robotic manipulation. Specifically, the paper focuses on recovering the 3D structure of an object from a single image and using this 3D information to determine the optimal grasping position and orientation, thereby achieving fast and accurate grasping tasks. This challenge mainly stems from the limited visual information provided by a single image and the complexity of object shapes and postures in the real world. To address these challenges, the authors propose a method named **Triplane Grasping**, which achieves the goal through the following steps: 1. **3D Reconstruction**: Use a hybrid Triplane - Gaussian representation to efficiently reconstruct the 3D point cloud of an object from a single RGB image. This process extracts image features through a pre - trained DINOv2 model, combines camera information, generates an initial rough point cloud through a point - cloud decoder of the Transformer architecture, and then generates a detailed 3D feature representation through a Triplane decoder. 2. **Grasping Posture Generation**: Utilize the Contact - GraspNet method to directly generate 6 - DoF grasping postures based on the generated 3D point cloud. This method regards the points in the 3D point cloud as potential grasping contact points, simplifies the high - dimensional grasping learning problem into a low - dimensional learning task by classifying contact points and estimating grasping rotation, and improves learning efficiency and the accuracy of grasping postures. 3. **Grasping Contact Filtering**: After generating the 6 - DoF grasping postures, ensure that the generated grasping postures are associated with the target object through grasping contact point filtering, avoid collisions or undesired interactions with other objects, and thus improve the grasping success rate. Experimental results show that the Triplane Grasping method exhibits reliable and efficient grasping decision - making capabilities when handling common desktop objects, with an average success rate of 72.37% and an average decision - making time of 1.27 seconds. Moreover, this method can also provide high - quality 3D point cloud representations and robust grasping posture generation in zero - shot scenarios, and has broad practical application potential. Future work directions include expanding this method to handle more complex and larger - scale real - world objects and exploring the use of large - scale language models to further improve 3D reconstruction quality and grasping optimization.

Triplane Grasping: Efficient 6-DoF Grasping with Single RGB Images

MonoGraspNet: 6-DoF Grasping with a Single RGB Image

RGBGrasp: Image-based Object Grasping by Capturing Multiple Views during Robot Arm Movement with Neural Radiance Fields

Single RGB Image 6D Object Grasping System Using Pixel-Wise Voting Network

6-DoF grasp estimation method that fuses RGB-D data based on external attention

Grasp Pose Detection from a Single RGB Image

Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection

ASGrasp: Generalizable Transparent Object Reconstruction and 6-Dof Grasp Detection from RGB-D Active Stereo Camera

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

ASGrasp: Generalizable Transparent Object Reconstruction and Grasping from RGB-D Active Stereo Camera

S4G: Amodal Single-view Single-Shot SE(3) Grasp Detection in Cluttered Scenes

A System of Robotic Grasping with Experience Acquisition.

6D Pose Estimation with Combined Deep Learning and 3D Vision Techniques for a Fast and Accurate Object Grasping

GraspNeRF: Multiview-based 6-DoF Grasp Detection for Transparent and Specular Objects Using Generalizable NeRF

CenterGrasp: Object-Aware Implicit Representation Learning for Simultaneous Shape Reconstruction and 6-DoF Grasp Estimation

Robotic Grasping With Multi-View Image Acquisition and Model-Based Pose Estimation

Visual Robotic Object Grasping Through Combining RGB-D Data and 3D Meshes.

Modular Anti-noise Deep Learning Network for Robotic Grasp Detection Based on RGB Images

Single-Camera Multi-View 6DoF pose estimation for robotic grasping

ICGNet: A Unified Approach for Instance-Centric Grasping

Efficient Fully Convolutional Network and Optimization Approach for Robotic Grasping Detection Based on RGB-D Images