EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Yucheng Han,Rui Wang,Chi Zhang,Juntao Hu,Pei Cheng,Bin Fu,Hanwang Zhang
2024-06-13
Abstract:Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper introduces a novel image generation model called EMMA, which aims to address the challenge of effectively balancing multiple conditions in existing methods for handling multimodal conditions. Existing text-to-image (T2I) diffusion models tend to favor a specific modality when dealing with multimodal conditions such as text and reference appearance. To solve this problem, EMMA is built on top of the state-of-the-art T2I diffusion model ELL and effectively integrates text and other complementary modal information through innovative multimodal feature connectors. EMMA is characterized by its ability to accept various multimodal cues without the need for further fine-tuning of the original T2I diffusion model, while maintaining strong text control over the generated results. In addition, EMMA can be combined with different existing diffusion models without the need for additional training. The paper also proposes a strategy for assembling the learned EMMA modules to generate images based on multiple modal conditions simultaneously. Key contributions of EMMA include: 1. Innovative integration mechanism for multimodal cues, which enhances the flexibility and applicability of the model by incorporating various modal information into the image generation process. 2. Modular and efficient model training, allowing for fast adaptation when introducing new conditions without the need for retraining. 3. Compatibility and adaptability, as a plug-and-play module, EMMA can be directly applied to various existing and emerging models based on stable diffusion frameworks. 4. Under different control signals, EMMA maintains high fidelity and detail in image generation. Its architecture design allows for the handling of multiple conditions and applications. Through experiments, the paper demonstrates the robust performance of EMMA in handling various modal conditions, showing its potential in advanced multimodal image generation tasks based on text and visual features, and showcasing its potential in high-quality and detail-rich image generation.