Abstract:Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficiently represent a particular subject as well as how to appropriately compose different subjects. We find that the text embedding regarding the subject token already serves as a simple yet effective representation that supports arbitrary combinations without any model tuning. Through learning a residual on top of the base embedding, we manage to robustly shift the raw subject to the customized subject given various text conditions. We then propose to employ layout, a very abstract and easy-to-obtain prior, as the spatial guidance for subject arrangement. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image, significantly alleviating the interference across them. Both qualitative and quantitative experimental results demonstrate our superiority over state-of-the-art alternatives under a variety of settings for multi-subject customization.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address several key issues in multi-subject image synthesis: 1. **High Training Cost**: Existing multi-subject customization methods require separate training for each subject combination, leading to exponentially increasing training costs as the number of subjects increases. 2. **Low Success Rate**: Current algorithms have a low success rate when handling multiple subjects, especially as the number of subjects increases. 3. **Subject Interference**: Different subjects may interfere with each other, causing some subjects to not appear in the final synthesized image or attribute confusion (e.g., a cat having features of another dog). To tackle these issues, the paper proposes a new method called **Cones 2**, which achieves efficient and controllable multi-subject image synthesis through the following two main aspects: 1. **Efficient Subject Representation**: By learning a residual embedding to represent specific subjects, this method allows arbitrary combinations of different subjects without retraining the model. 2. **Spatial Layout Guidance**: Using layout as spatial guidance, it controls the positions of different subjects by adjusting the activation values in the cross-attention map, thereby reducing subject interference. ### Main Contributions 1. **Simple and Effective Subject Representation**: By learning a residual embedding to represent specific subjects, it supports arbitrary combinations without model fine-tuning. 2. **Spatial Layout Guidance**: Using layout as spatial guidance, it precisely controls the position of each subject and significantly reduces subject interference. 3. **High Performance**: Experimental results show that this method outperforms existing methods in various settings (including multi-subject customization), especially when generating images with six or more subjects. ### Method Overview 1. **Text-Conditional Diffusion Model**: Based on a pre-trained text-to-image diffusion model, it customizes specific subjects by fine-tuning the text encoder part. 2. **Residual Embedding for Subject Representation**: By calculating the difference between the fine-tuned text encoder and the original text encoder, a residual embedding is obtained to represent specific subjects. 3. **Layout Guidance**: Using layout as spatial guidance, it controls the positions of different subjects by adjusting the activation values in the cross-attention map, reducing subject interference. ### Experimental Results 1. **Qualitative Comparison**: The generated images show that this method consistently produces high-quality images with two to four subjects, while other methods exhibit issues like subject omission and attribute confusion as the number of subjects increases. 2. **Quantitative Comparison**: Across multiple evaluation metrics (text alignment, visual similarity, storage space, and computational complexity), this method performs excellently in generating both single and multiple subjects. 3. **User Study**: User studies indicate that this method is the most preferred in multi-subject customization tasks, both in terms of image alignment and text alignment. 4. **Challenging Cases**: Demonstrates the advantages of this method in generating a large number of subjects and subjects with high semantic similarity. ### Conclusion Cones 2 successfully addresses the issues of high training cost, low success rate, and subject interference in multi-subject image synthesis through efficient subject representation and spatial layout guidance, providing a new solution for multi-subject customized image generation.

Cones 2: Customizable Image Synthesis with Multiple Subjects

Customizable GAN: Customizable Image Synthesis Based on Adversarial Learning.

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

RealCustom++: Representing Images as Real-Word for Real-Time Customization

FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

Cones: Concept Neurons in Diffusion Models for Customized Generation

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Training-Free Consistent Text-to-Image Generation

Tuning-Free Image Customization with Image and Text Guidance

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

DreamTuner: Single Image is Enough for Subject-Driven Generation