SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image

Dimitrije Antić,Sai Kumar Dwivedi,Shashank Tripathi,Theo Gevers,Dimitrios Tzionas
2024-09-24
Abstract:We focus on recovering 3D object pose and shape from single images. This is highly challenging due to strong (self-)occlusions, depth ambiguities, the enormous shape variance, and lack of 3D ground truth for natural images. Recent work relies mostly on learning from finite datasets, so it struggles generalizing, while it focuses mostly on the shape itself, largely ignoring the alignment with pixels. Moreover, it performs feed-forward inference, so it cannot refine estimates. We tackle these limitations with a novel framework, called SDFit. To this end, we make three key observations: (1) Learned signed-distance-function (SDF) models act as a strong morphable shape prior. (2) Foundational models embed 2D images and 3D shapes in a joint space, and (3) also infer rich features from images. SDFit exploits these as follows. First, it uses a category-level morphable SDF (mSDF) model, called DIT, to generate 3D shape hypotheses. This mSDF is initialized by querying OpenShape's latent space conditioned on the input image. Then, it computes 2D-to-3D correspondences, by extracting and matching features from the image and mSDF. Last, it fits the mSDF to the image in an render-and-compare fashion, to iteratively refine estimates. We evaluate SDFit on the Pix3D and Pascal3D+ datasets of real-world images. SDFit performs roughly on par with state-of-the-art learned methods, but, uniquely, requires no re-training. Thus, SDFit is promising for generalizing in the wild, paving the way for future research. Code will be released
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of recovering 3D object pose and shape from a single image. Specifically, the paper proposes a new framework named SDFit, which uses a deformable signed distance function (mSDF) model to generate 3D shape hypotheses and fits them to the image by means of rendering and comparison. This method aims to overcome the limitations of existing methods in generalization ability, joint optimization of geometry and pose estimation, etc. ### Background of the Paper and Problem Definition 1. **Task Challenges**: - **Depth Ambiguity**: A single image cannot provide sufficient depth information. - **Self - Occlusion**: Parts of an object may be occluded by itself or other objects. - **Shape Variation**: There are huge shape differences among different categories and within the same category of objects. - **Lack of Real - Data**: There is a lack of 3D - annotated data in natural images. 2. **Limitations of Existing Methods**: - **Data - Driven Methods**: Rely on limited datasets and have poor generalization ability. - **Feed - Forward Inference**: Unable to iteratively optimize, resulting in uncorrectable estimation errors. - **Pixel Alignment**: Most methods only focus on geometric shapes and ignore alignment with pixels. ### Core Contributions of the SDFit Framework 1. **Deformable Signed Distance Function (mSDF)**: - Use a pre - trained DIT model as a shape prior. - Generate 3D shape hypotheses and optimize them by means of rendering and comparison. 2. **Shape Initialization**: - Utilize the OpenShape model to retrieve the most similar 3D shape in the joint latent space of 2D images and 3D shapes. 3. **Pose Initialization**: - Use a base model to extract rich image features and establish 2D - 3D correspondences through multi - view rendering and back - projection. - Estimate the initial pose using the RANSAC and PnP algorithms. 4. **Optimization Process**: - Optimize the shape and pose by minimizing an energy function, which includes multiple loss terms such as masks, normal maps, and depth maps. ### Experimental Results 1. **Datasets**: - **Pix3D**: Contains real - world images and their corresponding CAD models. - **Pascal3D+**: Contains 3D object images of multiple categories. 2. **Evaluation Metrics**: - **Chamfer Distance (CD)**: Quantifies the similarity between two 3D point clouds. - **F - Score**: Reflects the accuracy of surface reconstruction within a given threshold. - **Intersection - over - Union (IoU)**: Encodes the alignment degree between the estimated 3D shape and image pixels. 3. **Experimental Results**: - The performance of SDFit on the Pix3D and Pascal3D+ datasets is comparable to that of existing data - driven methods, but it has better generalization ability and can handle unseen natural images without retraining. ### Conclusion The SDFit framework successfully solves the problem of recovering 3D object pose and shape from a single image by introducing the deformable signed distance function model and the base model. This method performs well in generalization ability and joint optimization, providing a new direction for future research.