3D Foundation Models Enable Simultaneous Geometry and Pose Estimation of Grasped Objects

Weiming Zhi,Haozhan Tang,Tianyi Zhang,Matthew Johnson-Roberson
2024-07-15
Abstract:Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot's coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera's frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator's joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.
Robotics,Machine Learning,Systems and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to simultaneously estimate the geometric shape and pose of an object grasped by a robot from a small number of RGB images. Specifically, the author hopes to endow the robot with human - like abilities, that is, after grasping an object, it can accurately estimate the influence of hand movement on the object movement, so as to realize the interaction with the environment. ### Problem Background In robot manipulation, generating motion trajectories usually depends on the known kinematic relationships between robot joint angles and related points. These points are usually located on the geometric structure of the robot, such as the pose of the end - effector or the collision detection points on the robot body. However, when the robot has grasped an object, in order to incorporate the points on these objects into the cost and constraint conditions, it is necessary to accurately estimate the geometric shape and pose of the grasped object. This information is not easily obtained after the object is grasped. ### Paper Objectives This paper aims to solve the problem of jointly estimating the geometric shape and pose of the grasped object from a small number of RGB images. Specifically, the author proposes a method that can estimate the geometric shape and pose of an object from the images captured by a fixed external monocular RGB camera and convert them into the coordinate system of the robot. This method does not require pre - calibration of the external parameters of the camera, but uses a pre - trained 3D base model (such as DUSt3R) for initial estimation, and then recovers the physically correct scale and transformation through the coordinate alignment problem. ### Main Contributions 1. **Unified Framework**: Use the emerging 3D base model to jointly estimate the geometric shape and pose of the grasped object from RGB images. 2. **Coordinate Alignment Problem**: Solve the coordinate alignment problem, making it possible to construct a kinematic mapping from robot joint angles to specified points on the grasped object. 3. **Experimental Evaluation**: Conduct an empirical evaluation of this framework on a series of daily objects to verify its effectiveness and robustness. ### Key Steps of the Solution - **Initial Estimation**: Use a pre - trained 3D base model (such as DUSt3R) to generate an initial geometric shape and pose estimate of the object from RGB images. - **Coordinate Alignment**: Convert the initial estimate into the coordinate system of the robot through an optimization problem and recover the physically correct scale. - **Kinematic Mapping**: Establish a mapping from robot joint angles to specified points on the grasped object, thereby realizing robot motion planning based on object coordinates. Through these steps, this paper provides an effective method that enables the robot to accurately estimate the geometric shape and pose of an object after grasping it and perform precise motion planning accordingly.