Triplane Grasping: Efficient 6-DoF Grasping with Single RGB Images

Yiming Li,Hanchi Ren,Jingjing Deng,Xianghua Xie
2024-10-21
Abstract:Reliable object grasping is one of the fundamental tasks in robotics. However, determining grasping pose based on single-image input has long been a challenge due to limited visual information and the complexity of real-world objects. In this paper, we propose Triplane Grasping, a fast grasping decision-making method that relies solely on a single RGB-only image as input. Triplane Grasping creates a hybrid Triplane-Gaussian 3D representation through a point decoder and a triplane decoder, which produce an efficient and high-quality reconstruction of the object to be grasped to meet real-time grasping requirements. We propose to use an end-to-end network to generate 6-DoF parallel-jaw grasp distributions directly from 3D points in the point cloud as potential grasp contacts and anchor the grasp pose in the observed data. Experiments demonstrate that our method achieves rapid modeling and grasping pose decision-making for daily objects, and exhibits a high grasping success rate in zero-shot scenarios.
Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently generate reliable 6 - Degree - of - Freedom (6 - DoF) grasping postures based on a single RGB image in robotic manipulation. Specifically, the paper focuses on recovering the 3D structure of an object from a single image and using this 3D information to determine the optimal grasping position and orientation, thereby achieving fast and accurate grasping tasks. This challenge mainly stems from the limited visual information provided by a single image and the complexity of object shapes and postures in the real world. To address these challenges, the authors propose a method named **Triplane Grasping**, which achieves the goal through the following steps: 1. **3D Reconstruction**: Use a hybrid Triplane - Gaussian representation to efficiently reconstruct the 3D point cloud of an object from a single RGB image. This process extracts image features through a pre - trained DINOv2 model, combines camera information, generates an initial rough point cloud through a point - cloud decoder of the Transformer architecture, and then generates a detailed 3D feature representation through a Triplane decoder. 2. **Grasping Posture Generation**: Utilize the Contact - GraspNet method to directly generate 6 - DoF grasping postures based on the generated 3D point cloud. This method regards the points in the 3D point cloud as potential grasping contact points, simplifies the high - dimensional grasping learning problem into a low - dimensional learning task by classifying contact points and estimating grasping rotation, and improves learning efficiency and the accuracy of grasping postures. 3. **Grasping Contact Filtering**: After generating the 6 - DoF grasping postures, ensure that the generated grasping postures are associated with the target object through grasping contact point filtering, avoid collisions or undesired interactions with other objects, and thus improve the grasping success rate. Experimental results show that the Triplane Grasping method exhibits reliable and efficient grasping decision - making capabilities when handling common desktop objects, with an average success rate of 72.37% and an average decision - making time of 1.27 seconds. Moreover, this method can also provide high - quality 3D point cloud representations and robust grasping posture generation in zero - shot scenarios, and has broad practical application potential. Future work directions include expanding this method to handle more complex and larger - scale real - world objects and exploring the use of large - scale language models to further improve 3D reconstruction quality and grasping optimization.