Abstract:Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.

What problem does this paper attempt to address?

This paper introduces a novel image generation model called EMMA, which aims to address the challenge of effectively balancing multiple conditions in existing methods for handling multimodal conditions. Existing text-to-image (T2I) diffusion models tend to favor a specific modality when dealing with multimodal conditions such as text and reference appearance. To solve this problem, EMMA is built on top of the state-of-the-art T2I diffusion model ELL and effectively integrates text and other complementary modal information through innovative multimodal feature connectors. EMMA is characterized by its ability to accept various multimodal cues without the need for further fine-tuning of the original T2I diffusion model, while maintaining strong text control over the generated results. In addition, EMMA can be combined with different existing diffusion models without the need for additional training. The paper also proposes a strategy for assembling the learned EMMA modules to generate images based on multiple modal conditions simultaneously. Key contributions of EMMA include: 1. Innovative integration mechanism for multimodal cues, which enhances the flexibility and applicability of the model by incorporating various modal information into the image generation process. 2. Modular and efficient model training, allowing for fast adaptation when introducing new conditions without the need for retraining. 3. Compatibility and adaptability, as a plug-and-play module, EMMA can be directly applied to various existing and emerging models based on stable diffusion frameworks. 4. Under different control signals, EMMA maintains high fidelity and detail in image generation. Its architecture design allows for the handling of multiple conditions and applications. Through experiments, the paper demonstrates the robust performance of EMMA in handling various modal conditions, showing its potential in advanced multimodal image generation tasks based on text and visual features, and showcasing its potential in high-quality and detail-rich image generation.

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Diversified text-to-image generation via deep mutual information estimation

Emage: Non-Autoregressive Text-to-Image Generation

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Conditional Text Image Generation with Diffusion Models

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

On the Multi-modal Vulnerability of Diffusion Models

If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection

Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement