Reconstructing Hand-Held Objects in 3D from Images and Videos

Jane Wu,Georgios Pavlakos,Georgia Gkioxari,Jitendra Malik
2024-11-26
Abstract:Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from Internet videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for hand-held object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time. In order to obtain the best performing single frame model, we first present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image(s); we call this alignment Retrieval-Augmented Reconstruction (RAR). RAR provides unified object geometry across all frames, and the result is rigidly aligned with both the input images and 3D MCC-HO observations in a temporally consistent manner. Experiments demonstrate that our approach achieves state-of-the-art performance on lab and Internet image/video datasets. We make our code and models available on the project website: <a class="link-external link-https" href="https://janehwu.github.io/mcc-ho" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reconstruct the three - dimensional geometric structure of hand - held objects from monocular RGB images or videos. Specifically, the paper focuses on recovering the interaction between hands and objects from Internet videos, especially the geometric shape of hand - held objects. This problem is very important in computer vision and robotics because: 1. **3D Reconstruction**: The hand occludes most of the object, and the object usually occupies only a small number of image pixels, which makes it very challenging to reconstruct the object from a single RGB image. For videos, it is also necessary to ensure the temporal consistency of the hand and object trajectories. 2. **Data Requirements**: In robotic manipulation, especially in manipulation tasks, the demand for large - scale data is very high. However, unlike fields such as natural language processing and computer vision, large - scale robotic manipulation trajectory data is not easily available. Although video data on the Internet is abundant, extracting 3D hand and object trajectories from it is a significant challenge. To address these challenges, the paper proposes a scalable method that uses object recognition and retrieval to guide the reconstruction of hand - held objects. The method mainly consists of three stages: 1. **Learning - Based Hand and Object Reconstruction**: Use the MCC - Hand - Object (MCC - HO) model to jointly reconstruct the geometric structures of the hand and the object from a single RGB image and the inferred 3D hand input. 2. **Object Model Retrieval**: Retrieve 3D models that match the objects in the image through GPT - 4(V) and text - to - 3D generation models (such as Genie). This process is called Retrieval - Augmented Reconstruction (RAR). 3. **Temporally Consistent Rigid Alignment**: Rigidly align the retrieved 3D object model with the input image or video and the geometric structure inferred by MCC - HO to ensure temporal consistency. Experimental results show that this method achieves state - of - the - art performance on both laboratory and Internet image / video datasets.