Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

Yiying Yang,Fukun Yin,Jiayuan Fan,Xin Chen,Wanzhang Li,Gang Yu

2024-08-20

Abstract:As Artificial Intelligence Generated Content (AIGC) advances, a variety of methods have been developed to generate text, images, videos, and 3D objects from single or multimodal inputs, contributing efforts to emulate human-like cognitive content creation. However, generating realistic large-scale scenes from a single input presents a challenge due to the complexities involved in ensuring consistency across extrapolated views generated by models. Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, that not only ensures realism and diversity through the video generation framework but also uses implicit neural fields combined with Masked Autoencoders (MAE) to effectively ensures the consistency of unseen areas across views. Specifically, we initially warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. However, these filled images usually fail to maintain view consistency, thus we utilize the produced views to optimize a neural radiance field, enhancing geometric consistency. Moreover, to further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model. Extensive experiments demonstrate that our method can generate realistic and consistent scenes from a single prompt. Both qualitative and quantitative results indicate that our approach surpasses existing state-of-the-art methods. We show encourage video examples at <a class="link-external link-https" href="https://yiyingyang12.github.io/Scene123.github.io/" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the challenges faced when generating 3D scenes from a single image or text description, ensuring that the generated scenes are consistent and realistic from different viewpoints. Specifically, the paper proposes a new framework called Scene123, which combines video generation models and enhanced consistency Masked Autoencoders (MAE) to ensure consistency in geometric structure and texture details of the generated 3D scenes. #### Main Contributions 1. **Innovative Framework**: Proposes a new framework for generating 3D scenes from a single prompt (image or text), combining MAE with video generation models for the first time to ensure consistency and realism of the generated scenes. 2. **Enhanced Consistency MAE**: Designs an enhanced consistency MAE module that fills in invisible areas from new viewpoints by injecting global semantic information, ensuring consistency in surface representation. 3. **Video-Assisted 3D Perception Generation Refinement Module**: Introduces a video-assisted 3D perception generation refinement module, significantly improving the detail and texture fidelity of scene reconstruction through the diversity and realism of video generation models. 4. **Experimental Validation**: Extensive experiments validate the effectiveness of Scene123, outperforming existing methods in terms of surface reconstruction accuracy, realism of reconstructed viewpoints, and texture fidelity. ### Summary The paper primarily addresses the problem of generating high-quality and consistent 3D scenes from a single input (image or text) and achieves this goal through various technical means, thereby advancing the field of generative AI in 3D scene generation.

Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

VideoStudio: Generating Consistent-Content and Multi-Scene Videos

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text

MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

GenXD: Generating Any 3D and 4D Scenes

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Patch-enhanced Mask Encoder Prompt Image Generation

VideoBooth: Diffusion-based Video Generation with Image Prompts

Generating 3D-Consistent Videos from Unposed Internet Photos

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance