Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text

Xinyang Li,Zhangyu Lai,Linning Xu,Yansong Qu,Liujuan Cao,Shengchuan Zhang,Bo Dai,Rongrong Ji
2024-06-25
Abstract:Recent advancements in 3D generation have leveraged synthetic datasets with ground truth 3D assets and predefined cameras. However, the potential of adopting real-world datasets, which can produce significantly more realistic 3D scenes, remains largely unexplored. In this work, we delve into the key challenge of the complex and scene-specific camera trajectories found in real-world captures. We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories. To achieve this, (1) we first utilize a Trajectory Diffusion Transformer, acting as the Cinematographer, to model the distribution of camera trajectories based on textual descriptions. (2) Next, a Gaussian-driven Multi-view Latent Diffusion Model serves as the Decorator, modeling the image sequence distribution given the camera trajectories and texts. This model, fine-tuned from a 2D diffusion model, directly generates pixel-aligned 3D Gaussians as an immediate 3D scene representation for consistent denoising. (3) Lastly, the 3D Gaussians are refined by a novel SDS++ loss as the Detailer, which incorporates the prior of the 2D diffusion model. Extensive experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of generating realistic 3D scenes and camera trajectories from text in the real world. Specifically, existing 3D generation methods mainly rely on synthetic datasets with predefined 3D assets and camera setups, while real-world multi-view captures have complex, scene-specific camera trajectories and unbounded backgrounds, making it very challenging to generate realistic 3D scenes. To this end, the paper proposes a new framework called Director3D, which aims to address these issues through the following three key components: 1. **Traj-DiT (Trajectory Diffusion Transformer) as the Director**: Generates dense view camera trajectories based on text descriptions. Camera parameters (intrinsic and extrinsic) are treated as time stamps, and the Transformer model performs conditional denoising on the camera trajectories. 2. **GM-LDM (Gaussian-driven Multi-view Latent Diffusion Model) as the Decorator**: Utilizes the camera trajectories of sparse view subsets for image sequence diffusion, generating pixel-aligned and unbounded 3D Gaussian distributions as intermediate 3D representations. This model is fine-tuned from a 2D latent diffusion model, leveraging strong priors and joint training with multi-view and single-view data to alleviate the diversity and limited quantity issues of real-world captures, thereby improving generalization capability. 3. **SDS++ Loss as the Refiner**: Enhances the visual quality of the 3D Gaussian distribution by backpropagating the novel SDS++ loss from images rendered by randomly interpolated cameras within the trajectory. Through these components, Director3D is able to surpass existing methods in generating realistic 3D scenes, providing better performance.