MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

Ruicheng Wang,Sicheng Xu,Cassie Dai,Jianfeng Xiang,Yu Deng,Xin Tong,Jiaolong Yang
2024-10-25
Abstract:We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitate effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervisions that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view. Code and models will be released on our project page.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the problem of recovering accurate 3D geometric structures from monocular open-domain images. Specifically, the authors propose a new method called MoGe, which can directly predict the 3D point map of a scene from a single image and has affine invariance (i.e., it is insensitive to global scale and translation). This method not only generates high-quality 3D shapes but also generalizes well to open-domain images. ### Main Challenges 1. **High Uncertainty in Monocular Geometry Estimation**: Recovering 3D geometric structures from a single image is a highly uncertain problem due to the lack of depth information provided by stereo vision. 2. **Difficulty in Estimating Camera Intrinsics**: Inferring camera intrinsics (such as focal length) from a single image without strong geometric cues is very challenging, which can lead to significant geometric distortions. 3. **Effectiveness of Training Supervision**: Designing effective training supervision signals is crucial for improving the robustness and accuracy of the model, but existing methods are insufficient in this regard. ### Solutions 1. **Affine Invariant Point Map**: Unlike traditional scale-invariant point maps, the point map predicted by MoGe is insensitive to global scale and translation, which helps eliminate the focal length-distance ambiguity, thereby improving the effectiveness of network training. 2. **Robust, Optimal, and Efficient Global Alignment Solver (ROE)**: To compute global alignment parameters, the authors propose a robust, optimal, and efficient global alignment solver, significantly improving training effectiveness and final accuracy. 3. **Multi-Scale Local Geometric Loss**: To enhance local geometric learning, the authors introduce a multi-scale local geometric loss, which penalizes local differences in the 3D point cloud through independent optimal affine alignment, significantly improving the accuracy of local geometric predictions. ### Experimental Results 1. **Extensive Evaluation**: The authors conducted zero-shot evaluations on multiple unseen datasets, showing that MoGe significantly outperforms existing methods in all tasks, including monocular 3D point map, depth map, and camera field of view estimation. 2. **Performance Improvement**: Compared to the previous best methods, MoGe reduces the error rate in point cloud output tasks by over 35%, and in depth estimation and camera field of view estimation tasks by more than 20% to 30%. ### Summary By introducing affine invariant point maps and carefully designed training supervision signals, MoGe successfully addresses the problem of 3D geometric estimation in monocular open-domain images, demonstrating superior performance and extensive generalization capabilities across various tasks. This method is expected to become a powerful foundational model for solving monocular geometric problems, promoting the development of applications such as 3D perception image editing, image relighting, depth-to-image synthesis, novel view synthesis, and 3D scene understanding.