Abstract:In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting \& outpainting, and instruction-based editing. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during inpainting and outpainting. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications in an interactive manner.

What problem does this paper attempt to address?

The main goal of this paper is to propose a Multimodal Generation Assistant (LLMGA) based on Large Language Models (LLM) to assist users in image generation and editing. Specifically, LLMGA leverages the powerful knowledge base and reasoning capabilities of large language models to refine prompts in the image generation process, thereby achieving precise control over the Stable Diffusion (SD) model. The key contributions and technical features of LLMGA include: 1. **Detailed Language Generation Prompts**: Unlike previous methods that use fixed-size embeddings to control SD, LLMGA provides detailed natural language prompts. This not only enhances the LLM's understanding of context but also reduces noise in the generation prompts, resulting in more refined and accurate content and improving the network's interpretability. 2. **Comprehensive Dataset Construction**: To train LLMGA, the authors constructed a comprehensive dataset comprising multiple parts that cover tasks such as prompt refinement, similar image generation, inpainting & outpainting, and instruction-based editing. 3. **Two-Stage Training Scheme**: First, the LLM is trained to master the characteristics of image generation and editing; second, the SD is optimized to adapt to the detailed prompts generated by the LLM. This step-by-step training helps the SD better understand and execute the complex instructions provided by the LLM. 4. **Reference Restoration Network**: To address inconsistencies in texture, brightness, and contrast between newly generated areas and retained areas during the inpainting and outpainting processes, the paper proposes a Diffusion-based Reference Restoration Network (DiffRIR). This network can significantly improve these discrepancies. Through the above methods, LLMGA not only can generate high-quality images but also provides an interactive image generation and editing experience, allowing users to design satisfactory images in a more flexible and convenient manner. Additionally, LLMGA can be integrated with other external plugins to extend its application range.

LLMGA: Multimodal Large Language Model based Generation Assistant

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

InfMLLM: A Unified Framework for Visual-Language Tasks.

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

Supervised Knowledge Makes Large Language Models Better In-context Learners

LLMs Meet Multimodal Generation and Editing: A Survey

Liquid: Language Models are Scalable Multi-modal Generators

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

LLMCad: Fast and Scalable On-device Large Language Model Inference

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Unified Generative and Discriminative Training for Multi-modal Large Language Models

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

Improving Visual Commonsense in Language Models via Multiple Image Generation

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks