Abstract:Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot's coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera's frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator's joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to simultaneously estimate the geometric shape and pose of an object grasped by a robot from a small number of RGB images. Specifically, the author hopes to endow the robot with human - like abilities, that is, after grasping an object, it can accurately estimate the influence of hand movement on the object movement, so as to realize the interaction with the environment. ### Problem Background In robot manipulation, generating motion trajectories usually depends on the known kinematic relationships between robot joint angles and related points. These points are usually located on the geometric structure of the robot, such as the pose of the end - effector or the collision detection points on the robot body. However, when the robot has grasped an object, in order to incorporate the points on these objects into the cost and constraint conditions, it is necessary to accurately estimate the geometric shape and pose of the grasped object. This information is not easily obtained after the object is grasped. ### Paper Objectives This paper aims to solve the problem of jointly estimating the geometric shape and pose of the grasped object from a small number of RGB images. Specifically, the author proposes a method that can estimate the geometric shape and pose of an object from the images captured by a fixed external monocular RGB camera and convert them into the coordinate system of the robot. This method does not require pre - calibration of the external parameters of the camera, but uses a pre - trained 3D base model (such as DUSt3R) for initial estimation, and then recovers the physically correct scale and transformation through the coordinate alignment problem. ### Main Contributions 1. **Unified Framework**: Use the emerging 3D base model to jointly estimate the geometric shape and pose of the grasped object from RGB images. 2. **Coordinate Alignment Problem**: Solve the coordinate alignment problem, making it possible to construct a kinematic mapping from robot joint angles to specified points on the grasped object. 3. **Experimental Evaluation**: Conduct an empirical evaluation of this framework on a series of daily objects to verify its effectiveness and robustness. ### Key Steps of the Solution - **Initial Estimation**: Use a pre - trained 3D base model (such as DUSt3R) to generate an initial geometric shape and pose estimate of the object from RGB images. - **Coordinate Alignment**: Convert the initial estimate into the coordinate system of the robot through an optimization problem and recover the physically correct scale. - **Kinematic Mapping**: Establish a mapping from robot joint angles to specified points on the grasped object, thereby realizing robot motion planning based on object coordinates. Through these steps, this paper provides an effective method that enables the robot to accurately estimate the geometric shape and pose of an object after grasping it and perform precise motion planning accordingly.

3D Foundation Models Enable Simultaneous Geometry and Pose Estimation of Grasped Objects

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models

Object Detection and Pose Estimation from RGB and Depth Data for Real-time, Adaptive Robotic Grasping

Vision-Based Categorical Object Pose Estimation and Manipulation.

Instance-level 6D pose estimation based on multi-task parameter sharing for robotic grasping

Single-Camera Multi-View 6DoF pose estimation for robotic grasping

Point Pair Feature Based 6D Pose Estimation for Robotic Grasping

6D Pose Estimation with Combined Deep Learning and 3D Vision Techniques for a Fast and Accurate Object Grasping

High-Precision Pose Estimation Method of the 3C Parts by Combining 2D and 3D Vision for Robotic Grasping in Assembly Applications.

Nothing But Geometric Constraints: A Model-Free Method for Articulated Object Pose Estimation

Robotic Continuous Grasping System by Shape Transformer-Guided Multi-Object Category-Level 6D Pose Estimation

SuperQ-GRASP: Superquadrics-based Grasp Pose Estimation on Larger Objects for Mobile-Manipulation

Anthropomorphic Grasping with Neural Object Shape Completion

Robotic Continuous Grasping System by Shape Transformer-Guided Multiobject Category-Level 6-D Pose Estimation

Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation

Robotic Grasping Method with 6D Pose Estimation and Point Cloud Fusion

Estimating Pose of Object and Manipulator Grasping Control

Research on Model-Free 6D Object Pose Estimation Based on Vision 3D Matching.

Six-dimensional Target Pose Estimation for Robot Autonomous Manipulation: Methodology and Verification

RGBManip: Monocular Image-based Robotic Manipulation through Active Object Pose Estimation