Abstract:This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: generating complex 3D scenes with accurate spatial relationships from a single image. Specifically, existing methods have the following deficiencies when dealing with this problem: 1. **Reconstruction or retrieval techniques**: Methods relying on these techniques often perform poorly in unseen scenes due to data scarcity. 2. **Multi - stage object - by - object generation**: Although these methods utilize the strong prior knowledge of pre - trained models, their complex multi - step processes are prone to error accumulation and lack global scene context, resulting in possible alignment problems between generated objects. To solve these problems, the paper proposes MIDI (Multi - Instance Diffusion for Single Image to 3D Scene Generation), a new paradigm that extends the pre - trained image - to - 3D - object generation model to enable it to generate multiple 3D instances with accurate spatial relationships simultaneously. The core innovation of MIDI lies in the introduction of a novel multi - instance attention mechanism, which can effectively capture the interactions between objects and spatial consistency during the generation process without a complex multi - step process. ### Main contributions 1. **Establish a new paradigm**: Propose a multi - instance diffusion model, extending the pre - trained image - to - 3D - object generation model to generate 3D instances with spatial correlations. 2. **Introduce a multi - instance attention mechanism**: This mechanism effectively models the cross - instance interactions, ensuring coherence and accurate spatial relationships. 3. **Experimental verification**: Through experiments on synthetic datasets, real - world scene data, and stylized images generated by text - to - image diffusion models, the superior performance of MIDI in generating high - quality 3D scenes is proven. ### Method overview MIDI works as follows: - **Multi - instance diffusion model**: Expand the DiT module in the original 3D - object generation model to denoise the latent representations of multiple 3D instances simultaneously, and introduce a multi - instance attention mechanism to learn cross - instance interactions. - **Multi - instance attention mechanism**: By integrating the features of all instances into the attention calculation, each instance can pay attention to the information of all other instances in the scene, thereby better capturing the relationships and spatial dependencies between objects. - **Training process**: By expanding the loss function of the base model from single - object to multi - instance, and combining a small amount of scene - level data and single - object data for training, the generalization ability of the pre - trained model is maintained. Through these innovations, MIDI significantly improves the quality and accuracy of generating 3D scenes from a single image, especially when dealing with complex scenes and diverse inputs.

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Generating Images with 3D Annotations Using Diffusion Models

Multi3D: 3D-Aware Multimodal Image Synthesis

ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

Multi-view Consistent Generative Adversarial Networks for Compositional 3D-Aware Image Synthesis

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

Diffusion Time-step Curriculum for One Image to 3D Generation