You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Baorui Ma,Huachen Gao,Haoge Deng,Zhengxiong Luo,Tiejun Huang,Lulu Tang,Xinlong Wang
2024-12-10
Abstract:Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: <a class="link-external link-https" href="https://vision.baai.ac.cn/see3d" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use large - scale Internet video data to train a multi - view diffusion model (MVD) without precise 3D geometry or camera pose annotations, in order to achieve open - world 3D content generation. Specifically, the author hopes to build a model that can acquire 3D knowledge only by "watching" a large amount of video content - that is, "you see it, you get it". To achieve this goal, they propose the following solutions: 1. **Creation of a large - scale video dataset**: By proposing a data screening pipeline, the author automatically filters out a high - quality, diverse multi - view image dataset WebVi3D containing static scenes and sufficient multi - view observations from a vast amount of Internet videos. This dataset contains approximately 16 million video clips, with a total duration of 4.41 years. 2. **Introduction of visual conditions**: To solve the problem of no explicit 3D geometry or camera pose annotations, the author introduces an innovative visual - condition. This condition is a pure 2D inductive visual signal generated by adding time - dependent noise to the masked video data. This enables the model to be trained without relying on expensive pose annotations. 3. **Deformation - based 3D generation framework**: The author proposes a new deformation - based 3D generation framework, combined with the See3D model, for high - fidelity 3D generation. This framework first uses See3D to construct visual conditions, then iteratively refines the geometry of new views, and the finally generated images can be used for Gaussian point cloud reconstruction or converted into meshes. ### Formula Explanation - **Time - dependent visual condition formula**: \[ C_t=\sqrt{\bar{\alpha}_t'}(1 - M)X_0+\sqrt{1-\bar{\alpha}_t'}\epsilon,\quad\epsilon\sim\mathcal{N}(0, I) \] where \(C_t\) is the "corrupted" video data after masking and noise processing; \(M\) is the mask matrix; \(X_0\) is the original multi - view observation; \(\bar{\alpha}_t'\) is the variance in the noise schedule; \(\epsilon\) is the noise of the standard normal distribution. \[ V_t = [W_t * C_t+(1 - W_t) * X_t; M] \] where \(V_t\) is the final visual condition, which is formed by the mixture of \(C_t\) and \(X_t\) and the concatenation of the mask \(M\); \(W_t\) is a weight that monotonically decreases with the time step. Through these methods, the author shows the significant zero - sample and open - world generation capabilities of See3D on single - view and sparse - view reconstruction benchmarks, which are significantly better than those models trained with expensive and limited 3D datasets. In addition, this model also naturally supports 3D creation tasks under other image conditions, such as 3D editing, without further fine - tuning.