Abstract:Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: <a class="link-external link-https" href="https://vision.baai.ac.cn/see3d" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use large - scale Internet video data to train a multi - view diffusion model (MVD) without precise 3D geometry or camera pose annotations, in order to achieve open - world 3D content generation. Specifically, the author hopes to build a model that can acquire 3D knowledge only by "watching" a large amount of video content - that is, "you see it, you get it". To achieve this goal, they propose the following solutions: 1. **Creation of a large - scale video dataset**: By proposing a data screening pipeline, the author automatically filters out a high - quality, diverse multi - view image dataset WebVi3D containing static scenes and sufficient multi - view observations from a vast amount of Internet videos. This dataset contains approximately 16 million video clips, with a total duration of 4.41 years. 2. **Introduction of visual conditions**: To solve the problem of no explicit 3D geometry or camera pose annotations, the author introduces an innovative visual - condition. This condition is a pure 2D inductive visual signal generated by adding time - dependent noise to the masked video data. This enables the model to be trained without relying on expensive pose annotations. 3. **Deformation - based 3D generation framework**: The author proposes a new deformation - based 3D generation framework, combined with the See3D model, for high - fidelity 3D generation. This framework first uses See3D to construct visual conditions, then iteratively refines the geometry of new views, and the finally generated images can be used for Gaussian point cloud reconstruction or converted into meshes. ### Formula Explanation - **Time - dependent visual condition formula**: \[ C_t=\sqrt{\bar{\alpha}_t'}(1 - M)X_0+\sqrt{1-\bar{\alpha}_t'}\epsilon,\quad\epsilon\sim\mathcal{N}(0, I) \] where \(C_t\) is the "corrupted" video data after masking and noise processing; \(M\) is the mask matrix; \(X_0\) is the original multi - view observation; \(\bar{\alpha}_t'\) is the variance in the noise schedule; \(\epsilon\) is the noise of the standard normal distribution. \[ V_t = [W_t * C_t+(1 - W_t) * X_t; M] \] where \(V_t\) is the final visual condition, which is formed by the mixture of \(C_t\) and \(X_t\) and the concatenation of the mask \(M\); \(W_t\) is a weight that monotonically decreases with the time step. Through these methods, the author shows the significant zero - sample and open - world generation capabilities of See3D on single - view and sparse - view reconstruction benchmarks, which are significantly better than those models trained with expensive and limited 3D datasets. In addition, this model also naturally supports 3D creation tasks under other image conditions, such as 3D editing, without further fine - tuning.

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Generating 3D-Consistent Videos from Unposed Internet Photos

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

V3D: Video Diffusion Models are Effective 3D Generators

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

The More You See in 2D, the More You Perceive in 3D

Ivs-Net: Learning Human View Synthesis from Internet Videos

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

GenXD: Generating Any 3D and 4D Scenes

Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model