Abstract:Robotic manipulation requires accurate perception of the environment, which poses a significant challenge due to its inherent complexity and constantly changing nature. In this context, RGB image and point-cloud observations are two commonly used modalities in visual-based robotic manipulation, but each of these modalities have their own limitations. Commercial point-cloud observations often suffer from issues like sparse sampling and noisy output due to the limits of the emission-reception imaging principle. On the other hand, RGB images, while rich in texture information, lack essential depth and 3D information crucial for robotic manipulation. To mitigate these challenges, we propose an image-only robotic manipulation framework that leverages an eye-on-hand monocular camera installed on the robot's parallel gripper. By moving with the robot gripper, this camera gains the ability to actively perceive object from multiple perspectives during the manipulation process. This enables the estimation of 6D object poses, which can be utilized for manipulation. While, obtaining images from more and diverse viewpoints typically improves pose estimation, it also increases the manipulation time. To address this trade-off, we employ a reinforcement learning policy to synchronize the manipulation strategy with active perception, achieving a balance between 6D pose accuracy and manipulation efficiency. Our experimental results in both simulated and real-world environments showcase the state-of-the-art effectiveness of our approach. %, which, to the best of our knowledge, is the first to achieve robust real-world robotic manipulation through active pose estimation. We believe that our method will inspire further research on real-world-oriented robotic manipulation.

RGB Matters: Learning 7-DoF Grasp Poses on Monocular RGBD Images

MonoGraspNet: 6-DoF Grasping with a Single RGB Image

RGBGrasp: Image-based Object Grasping by Capturing Multiple Views during Robot Arm Movement with Neural Radiance Fields

6-DoF grasp estimation method that fuses RGB-D data based on external attention

Modular Anti-noise Deep Learning Network for Robotic Grasp Detection Based on RGB Images

RGB-D Grasp Detection via Depth Guided Learning with Cross-modal Attention

Grasp Pose Detection from a Single RGB Image

Efficient Fully Convolutional Network and Optimization Approach for Robotic Grasping Detection Based on RGB-D Images

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

Triplane Grasping: Efficient 6-DoF Grasping with Single RGB Images

GraspNeRF: Multiview-based 6-DoF Grasp Detection for Transparent and Specular Objects Using Generalizable NeRF

Lightweight Pixel-Wise Generative Robot Grasping Detection Based on RGB-D Dense Fusion

ASGrasp: Generalizable Transparent Object Reconstruction and Grasping from RGB-D Active Stereo Camera

RGBManip: Monocular Image-based Robotic Manipulation through Active Object Pose Estimation

OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

ASGrasp: Generalizable Transparent Object Reconstruction and 6-Dof Grasp Detection from RGB-D Active Stereo Camera

Real-Time Pixel-Wise Grasp Detection Based on RGB-D Feature Dense Fusion

Deep learning for detecting robotic grasps

Single RGB Image 6D Object Grasping System Using Pixel-Wise Voting Network

A Novel Robotic Grasp Detection Framework Using Low-Cost RGB-D Camera for Industrial Bin Picking

Single-Camera Multi-View 6DoF pose estimation for robotic grasping