Abstract:Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from Internet videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for hand-held object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time. In order to obtain the best performing single frame model, we first present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image(s); we call this alignment Retrieval-Augmented Reconstruction (RAR). RAR provides unified object geometry across all frames, and the result is rigidly aligned with both the input images and 3D MCC-HO observations in a temporally consistent manner. Experiments demonstrate that our approach achieves state-of-the-art performance on lab and Internet image/video datasets. We make our code and models available on the project website: <a class="link-external link-https" href="https://janehwu.github.io/mcc-ho" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reconstruct the three - dimensional geometric structure of hand - held objects from monocular RGB images or videos. Specifically, the paper focuses on recovering the interaction between hands and objects from Internet videos, especially the geometric shape of hand - held objects. This problem is very important in computer vision and robotics because: 1. **3D Reconstruction**: The hand occludes most of the object, and the object usually occupies only a small number of image pixels, which makes it very challenging to reconstruct the object from a single RGB image. For videos, it is also necessary to ensure the temporal consistency of the hand and object trajectories. 2. **Data Requirements**: In robotic manipulation, especially in manipulation tasks, the demand for large - scale data is very high. However, unlike fields such as natural language processing and computer vision, large - scale robotic manipulation trajectory data is not easily available. Although video data on the Internet is abundant, extracting 3D hand and object trajectories from it is a significant challenge. To address these challenges, the paper proposes a scalable method that uses object recognition and retrieval to guide the reconstruction of hand - held objects. The method mainly consists of three stages: 1. **Learning - Based Hand and Object Reconstruction**: Use the MCC - Hand - Object (MCC - HO) model to jointly reconstruct the geometric structures of the hand and the object from a single RGB image and the inferred 3D hand input. 2. **Object Model Retrieval**: Retrieve 3D models that match the objects in the image through GPT - 4(V) and text - to - 3D generation models (such as Genie). This process is called Retrieval - Augmented Reconstruction (RAR). 3. **Temporally Consistent Rigid Alignment**: Rigidly align the retrieved 3D object model with the input image or video and the geometric structure inferred by MCC - HO to ensure temporal consistency. Experimental results show that this method achieves state - of - the - art performance on both laboratory and Internet image / video datasets.

Reconstructing Hand-Held Objects in 3D from Images and Videos

In-Hand 3D Object Reconstruction from a Monocular RGB Video

Reconstructing Hand-Held Objects from Monocular Video.

HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

Learning Hand-Held Object Reconstruction from In-The-Wild Videos

EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

HandO: a Hybrid 3D Hand–object Reconstruction Model for Unknown Objects

HandFormer: Hand Pose Reconstructing from a Single RGB Image

MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular Images.

3D Reconstruction of Objects in Hands without Real World 3D Supervision

Learning Explicit Contact for Implicit Reconstruction of Hand-held Objects from Monocular Images

Stereo Hand-Object Reconstruction for Human-to-Robot Handover

SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction

HandOS: 3D Hand Reconstruction in One Stage

UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos

Object Modelling with a Handheld RGB-D Camera

Contact-conditioned Hand-Held Object Reconstruction from Single-View Images

Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips

Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera