Abstract:We focus on recovering 3D object pose and shape from single images. This is highly challenging due to strong (self-)occlusions, depth ambiguities, the enormous shape variance, and lack of 3D ground truth for natural images. Recent work relies mostly on learning from finite datasets, so it struggles generalizing, while it focuses mostly on the shape itself, largely ignoring the alignment with pixels. Moreover, it performs feed-forward inference, so it cannot refine estimates. We tackle these limitations with a novel framework, called SDFit. To this end, we make three key observations: (1) Learned signed-distance-function (SDF) models act as a strong morphable shape prior. (2) Foundational models embed 2D images and 3D shapes in a joint space, and (3) also infer rich features from images. SDFit exploits these as follows. First, it uses a category-level morphable SDF (mSDF) model, called DIT, to generate 3D shape hypotheses. This mSDF is initialized by querying OpenShape's latent space conditioned on the input image. Then, it computes 2D-to-3D correspondences, by extracting and matching features from the image and mSDF. Last, it fits the mSDF to the image in an render-and-compare fashion, to iteratively refine estimates. We evaluate SDFit on the Pix3D and Pascal3D+ datasets of real-world images. SDFit performs roughly on par with state-of-the-art learned methods, but, uniquely, requires no re-training. Thus, SDFit is promising for generalizing in the wild, paving the way for future research. Code will be released

What problem does this paper attempt to address?

This paper attempts to solve the problem of recovering 3D object pose and shape from a single image. Specifically, the paper proposes a new framework named SDFit, which uses a deformable signed distance function (mSDF) model to generate 3D shape hypotheses and fits them to the image by means of rendering and comparison. This method aims to overcome the limitations of existing methods in generalization ability, joint optimization of geometry and pose estimation, etc. ### Background of the Paper and Problem Definition 1. **Task Challenges**: - **Depth Ambiguity**: A single image cannot provide sufficient depth information. - **Self - Occlusion**: Parts of an object may be occluded by itself or other objects. - **Shape Variation**: There are huge shape differences among different categories and within the same category of objects. - **Lack of Real - Data**: There is a lack of 3D - annotated data in natural images. 2. **Limitations of Existing Methods**: - **Data - Driven Methods**: Rely on limited datasets and have poor generalization ability. - **Feed - Forward Inference**: Unable to iteratively optimize, resulting in uncorrectable estimation errors. - **Pixel Alignment**: Most methods only focus on geometric shapes and ignore alignment with pixels. ### Core Contributions of the SDFit Framework 1. **Deformable Signed Distance Function (mSDF)**: - Use a pre - trained DIT model as a shape prior. - Generate 3D shape hypotheses and optimize them by means of rendering and comparison. 2. **Shape Initialization**: - Utilize the OpenShape model to retrieve the most similar 3D shape in the joint latent space of 2D images and 3D shapes. 3. **Pose Initialization**: - Use a base model to extract rich image features and establish 2D - 3D correspondences through multi - view rendering and back - projection. - Estimate the initial pose using the RANSAC and PnP algorithms. 4. **Optimization Process**: - Optimize the shape and pose by minimizing an energy function, which includes multiple loss terms such as masks, normal maps, and depth maps. ### Experimental Results 1. **Datasets**: - **Pix3D**: Contains real - world images and their corresponding CAD models. - **Pascal3D+**: Contains 3D object images of multiple categories. 2. **Evaluation Metrics**: - **Chamfer Distance (CD)**: Quantifies the similarity between two 3D point clouds. - **F - Score**: Reflects the accuracy of surface reconstruction within a given threshold. - **Intersection - over - Union (IoU)**: Encodes the alignment degree between the estimated 3D shape and image pixels. 3. **Experimental Results**: - The performance of SDFit on the Pix3D and Pascal3D+ datasets is comparable to that of existing data - driven methods, but it has better generalization ability and can handle unseen natural images without retraining. ### Conclusion The SDFit framework successfully solves the problem of recovering 3D object pose and shape from a single image by introducing the deformable signed distance function model and the base model. This method performs well in generalization ability and joint optimization, providing a new direction for future research.

SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

PoseSDF: Simultaneous 3D Human Shape Reconstruction and Gait Pose Estimation Using Signed Distance Functions

DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation

SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images

ShapeICP: Iterative Category-level Object Pose and Shape Estimation from Depth

TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction using Vision-Based Tactile Sensing

Robust Shape Fitting for 3D Scene Abstraction

ϕ-SfT: Shape-from-Template with a Physics-Based Deformation Model

sSfS: Segmented Shape from Silhouette Reconstruction of the Human Body

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

Structured 3D Features for Reconstructing Controllable Avatars

Mosaic-SDF for 3D Generative Models

Shape My Face: Registering 3D Face Scans by Surface-to-Surface Translation

DeepSSM: A Deep Learning Framework for Statistical Shape Modeling from Raw Images

3D Human Pose and Shape Estimation with Dense Correspondence from a Single Depth Image

Pixel2ISDF: Implicit Signed Distance Fields based Human Body Model from Multi-view and Multi-pose Images