Abstract:We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on arbitrary reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset comprising up to 1.2 million scenes, equipped with well-aligned metric depth. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at <a class="link-external link-https" href="https://github.com/ewrfcas/MVGenMaster/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in Multi - View Generation (NVS), specifically including: 1. **Data Limitations**: Most existing works rely on large - scale synthetic datasets, mainly for object - centered 3D generation tasks. Such datasets limit the application of these methods in complex scene - level NVS tasks. 2. **Lack of 3D Priors**: Currently, many NVS methods based on diffusion models rely heavily on 2D generation without integrating 3D priors. This limits their ability to ensure 3D consistency while scaling up, especially in out - of - domain (OOD) scenarios. 3. **Lack of Flexibility**: Existing NVS techniques usually lack the ability to handle arbitrary reference and target views, resulting in the need for cumbersome anchor - based iterative generation, dataset updates, and test - time optimization. These methods cannot handle all downstream NVS requirements simultaneously. To solve these problems, the authors propose MVGenMaster, a diffusion - model - based framework aimed at enhancing the multi - view generation ability by introducing 3D priors. The main contributions of MVGenMaster include: - **Generalization**: By using the metric depth prior, MVGenMaster ensures the consistency of multi - views and robust generalization ability in different scenarios. - **Flexibility**: MVGenMaster is a flexible multi - view diffusion model that can handle various downstream NVS tasks with variable target and reference views. - **Scalability**: The authors collected a large - scale multi - view dataset MvD - 1M containing 1.6 million scenes, specifically for training MVGenMaster. All images contain metric depth to support geometric transformations. In addition, MVGenMaster also introduced an innovative training - free key - rescaling technique, which solves the attention dilution problem, enabling the model to generate multiple novel views in a single forward pass without iterative generation. These improvements make MVGenMaster perform excellently in various benchmark tests and establish the latest NVS results. In summary, MVGenMaster significantly improves the quality, consistency, and flexibility of multi - view generation by integrating 3D priors and large - scale datasets.

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

MV-Adapter: Multi-view Consistent Image Generation Made Easy

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

MVDream: Multi-view Diffusion for 3D Generation

MVGS: Multi-view-regulated Gaussian Splatting for Novel View Synthesis

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-to-3D Generation

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis

MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

Multi-View Unsupervised Image Generation with Cross Attention Guidance

V3D: Video Diffusion Models are Effective 3D Generators

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Vivid-ZOO: Multi-View Video Generation with Diffusion Model