Abstract:We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitate effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervisions that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view. Code and models will be released on our project page.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the problem of recovering accurate 3D geometric structures from monocular open-domain images. Specifically, the authors propose a new method called MoGe, which can directly predict the 3D point map of a scene from a single image and has affine invariance (i.e., it is insensitive to global scale and translation). This method not only generates high-quality 3D shapes but also generalizes well to open-domain images. ### Main Challenges 1. **High Uncertainty in Monocular Geometry Estimation**: Recovering 3D geometric structures from a single image is a highly uncertain problem due to the lack of depth information provided by stereo vision. 2. **Difficulty in Estimating Camera Intrinsics**: Inferring camera intrinsics (such as focal length) from a single image without strong geometric cues is very challenging, which can lead to significant geometric distortions. 3. **Effectiveness of Training Supervision**: Designing effective training supervision signals is crucial for improving the robustness and accuracy of the model, but existing methods are insufficient in this regard. ### Solutions 1. **Affine Invariant Point Map**: Unlike traditional scale-invariant point maps, the point map predicted by MoGe is insensitive to global scale and translation, which helps eliminate the focal length-distance ambiguity, thereby improving the effectiveness of network training. 2. **Robust, Optimal, and Efficient Global Alignment Solver (ROE)**: To compute global alignment parameters, the authors propose a robust, optimal, and efficient global alignment solver, significantly improving training effectiveness and final accuracy. 3. **Multi-Scale Local Geometric Loss**: To enhance local geometric learning, the authors introduce a multi-scale local geometric loss, which penalizes local differences in the 3D point cloud through independent optimal affine alignment, significantly improving the accuracy of local geometric predictions. ### Experimental Results 1. **Extensive Evaluation**: The authors conducted zero-shot evaluations on multiple unseen datasets, showing that MoGe significantly outperforms existing methods in all tasks, including monocular 3D point map, depth map, and camera field of view estimation. 2. **Performance Improvement**: Compared to the previous best methods, MoGe reduces the error rate in point cloud output tasks by over 35%, and in depth estimation and camera field of view estimation tasks by more than 20% to 30%. ### Summary By introducing affine invariant point maps and carefully designed training supervision signals, MoGe successfully addresses the problem of 3D geometric estimation in monocular open-domain images, demonstrating superior performance and extensive generalization capabilities across various tasks. This method is expected to become a powerful foundational model for solving monocular geometric problems, promoting the development of applications such as 3D perception image editing, image relighting, depth-to-image synthesis, novel view synthesis, and 3D scene understanding.

MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

Geometry-Aware Network for Unsupervised Learning of Monocular Camera's Ego-Motion

Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

Self-supervised Learning of Monocular 3D Geometry Understanding with Two- and Three-View Geometric Constraints

GeoMVSNet: Learning Multi-View Stereo with Geometry Perception

Monocular 3D Detection With Geometric Constraint Embedding and Semi-Supervised Training

GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose

Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering

Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation

Geometry-Guided Domain Generalization for Monocular 3D Object Detection

MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors

GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints

Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding

GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds

Towards Accurate Reconstruction of 3D Scene Shape From A Single Monocular Image

GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D Object Detection

GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation

MoD-SLAM: Monocular Dense Mapping for Unbounded 3D Scene Reconstruction

Online Vectorized HD Map Construction using Geometry