Abstract:Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Notably, our method won the championship in the 2nd Monocular Depth Estimation Challenge. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available at <a class="link-external link-https" href="https://github.com/YvanYin/Metric3D" rel="external noopener nofollow">this https URL</a>.

Enhancing Zero-shot 3D Photography Via Mesh-represented Image Inpainting

Zero-1-to-3: Zero-shot One Image to 3D Object

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Learning 3D Photography Videos Via Self-supervised Diffusion on Single Images

PointDreamer: Zero-shot 3D Textured Mesh Reconstruction from Colored Point Cloud by 2D Inpainting

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

One Shot 3D Photography

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Generating 3D-Consistent Videos from Unposed Internet Photos

Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering

Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable Repainting

Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly

Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views