MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Zehuan Huang,Yuan-Chen Guo,Xingqiao An,Yunhan Yang,Yangguang Li,Zi-Xin Zou,Ding Liang,Xihui Liu,Yan-Pei Cao,Lu Sheng
2024-12-05
Abstract:This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: generating complex 3D scenes with accurate spatial relationships from a single image. Specifically, existing methods have the following deficiencies when dealing with this problem: 1. **Reconstruction or retrieval techniques**: Methods relying on these techniques often perform poorly in unseen scenes due to data scarcity. 2. **Multi - stage object - by - object generation**: Although these methods utilize the strong prior knowledge of pre - trained models, their complex multi - step processes are prone to error accumulation and lack global scene context, resulting in possible alignment problems between generated objects. To solve these problems, the paper proposes MIDI (Multi - Instance Diffusion for Single Image to 3D Scene Generation), a new paradigm that extends the pre - trained image - to - 3D - object generation model to enable it to generate multiple 3D instances with accurate spatial relationships simultaneously. The core innovation of MIDI lies in the introduction of a novel multi - instance attention mechanism, which can effectively capture the interactions between objects and spatial consistency during the generation process without a complex multi - step process. ### Main contributions 1. **Establish a new paradigm**: Propose a multi - instance diffusion model, extending the pre - trained image - to - 3D - object generation model to generate 3D instances with spatial correlations. 2. **Introduce a multi - instance attention mechanism**: This mechanism effectively models the cross - instance interactions, ensuring coherence and accurate spatial relationships. 3. **Experimental verification**: Through experiments on synthetic datasets, real - world scene data, and stylized images generated by text - to - image diffusion models, the superior performance of MIDI in generating high - quality 3D scenes is proven. ### Method overview MIDI works as follows: - **Multi - instance diffusion model**: Expand the DiT module in the original 3D - object generation model to denoise the latent representations of multiple 3D instances simultaneously, and introduce a multi - instance attention mechanism to learn cross - instance interactions. - **Multi - instance attention mechanism**: By integrating the features of all instances into the attention calculation, each instance can pay attention to the information of all other instances in the scene, thereby better capturing the relationships and spatial dependencies between objects. - **Training process**: By expanding the loss function of the base model from single - object to multi - instance, and combining a small amount of scene - level data and single - object data for training, the generalization ability of the pre - trained model is maintained. Through these innovations, MIDI significantly improves the quality and accuracy of generating 3D scenes from a single image, especially when dealing with complex scenes and diverse inputs.