MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Chenjie Cao,Chaohui Yu,Shang Liu,Fan Wang,Xiangyang Xue,Yanwei Fu
2024-11-25
Abstract:We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on arbitrary reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset comprising up to 1.2 million scenes, equipped with well-aligned metric depth. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at <a class="link-external link-https" href="https://github.com/ewrfcas/MVGenMaster/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in Multi - View Generation (NVS), specifically including: 1. **Data Limitations**: Most existing works rely on large - scale synthetic datasets, mainly for object - centered 3D generation tasks. Such datasets limit the application of these methods in complex scene - level NVS tasks. 2. **Lack of 3D Priors**: Currently, many NVS methods based on diffusion models rely heavily on 2D generation without integrating 3D priors. This limits their ability to ensure 3D consistency while scaling up, especially in out - of - domain (OOD) scenarios. 3. **Lack of Flexibility**: Existing NVS techniques usually lack the ability to handle arbitrary reference and target views, resulting in the need for cumbersome anchor - based iterative generation, dataset updates, and test - time optimization. These methods cannot handle all downstream NVS requirements simultaneously. To solve these problems, the authors propose MVGenMaster, a diffusion - model - based framework aimed at enhancing the multi - view generation ability by introducing 3D priors. The main contributions of MVGenMaster include: - **Generalization**: By using the metric depth prior, MVGenMaster ensures the consistency of multi - views and robust generalization ability in different scenarios. - **Flexibility**: MVGenMaster is a flexible multi - view diffusion model that can handle various downstream NVS tasks with variable target and reference views. - **Scalability**: The authors collected a large - scale multi - view dataset MvD - 1M containing 1.6 million scenes, specifically for training MVGenMaster. All images contain metric depth to support geometric transformations. In addition, MVGenMaster also introduced an innovative training - free key - rescaling technique, which solves the attention dilution problem, enabling the model to generate multiple novel views in a single forward pass without iterative generation. These improvements make MVGenMaster perform excellently in various benchmark tests and establish the latest NVS results. In summary, MVGenMaster significantly improves the quality, consistency, and flexibility of multi - view generation by integrating 3D priors and large - scale datasets.