Abstract:Recent advancements in 3D generation have leveraged synthetic datasets with ground truth 3D assets and predefined cameras. However, the potential of adopting real-world datasets, which can produce significantly more realistic 3D scenes, remains largely unexplored. In this work, we delve into the key challenge of the complex and scene-specific camera trajectories found in real-world captures. We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories. To achieve this, (1) we first utilize a Trajectory Diffusion Transformer, acting as the Cinematographer, to model the distribution of camera trajectories based on textual descriptions. (2) Next, a Gaussian-driven Multi-view Latent Diffusion Model serves as the Decorator, modeling the image sequence distribution given the camera trajectories and texts. This model, fine-tuned from a 2D diffusion model, directly generates pixel-aligned 3D Gaussians as an immediate 3D scene representation for consistent denoising. (3) Lastly, the 3D Gaussians are refined by a novel SDS++ loss as the Detailer, which incorporates the prior of the 2D diffusion model. Extensive experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation.

What problem does this paper attempt to address?

The paper attempts to address the problem of generating realistic 3D scenes and camera trajectories from text in the real world. Specifically, existing 3D generation methods mainly rely on synthetic datasets with predefined 3D assets and camera setups, while real-world multi-view captures have complex, scene-specific camera trajectories and unbounded backgrounds, making it very challenging to generate realistic 3D scenes. To this end, the paper proposes a new framework called Director3D, which aims to address these issues through the following three key components: 1. **Traj-DiT (Trajectory Diffusion Transformer) as the Director**: Generates dense view camera trajectories based on text descriptions. Camera parameters (intrinsic and extrinsic) are treated as time stamps, and the Transformer model performs conditional denoising on the camera trajectories. 2. **GM-LDM (Gaussian-driven Multi-view Latent Diffusion Model) as the Decorator**: Utilizes the camera trajectories of sparse view subsets for image sequence diffusion, generating pixel-aligned and unbounded 3D Gaussian distributions as intermediate 3D representations. This model is fine-tuned from a 2D latent diffusion model, leveraging strong priors and joint training with multi-view and single-view data to alleviate the diversity and limited quantity issues of real-world captures, thereby improving generalization capability. 3. **SDS++ Loss as the Refiner**: Enhances the visual quality of the 3D Gaussian distribution by backpropagating the novel SDS++ loss from images rendered by randomly interpolated cameras within the trajectory. Through these components, Director3D is able to surpass existing methods in generating realistic 3D scenes, providing better performance.

Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

GenXD: Generating Any 3D and 4D Scenes

DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

PaintScene4D: Consistent 4D Scene Generation from Text Prompts

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency

Comp4D: LLM-Guided Compositional 4D Scene Generation

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

V3D: Video Diffusion Models are Effective 3D Generators

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion