Abstract:Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. Project page is <a class="link-external link-https" href="https://zhenyuw16.github.io/GenArtist_page" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **High Diversity and Variability of Image Generation and Editing Requirements**: Existing image generation or editing methods are insufficient when facing users' diverse requirements, such as requirements for different objects and backgrounds, and specific requirements for various operations in text prompts. Different models often have different advantages and focuses. General models may not be as powerful as fine - tuned models in some aspects, but perform better when dealing with unseen data. 2. **Ability to Handle Complex Problems**: Current models still have difficulties in handling complex tasks, such as long and complex sentences in text - to - image tasks, or multi - step complex instructions in editing tasks. Although expanding the model scale or fine - tuning can alleviate this problem, due to the high variability and flexibility of text, there will always be some complex problems that are difficult to be effectively handled by well - trained models. 3. **Model Reliability**: Even if designed very elaborately, models will inevitably encounter some failure situations. Generated images sometimes cannot accurately correspond to the content of users' prompts. Existing models lack the ability to independently evaluate the correctness of generated images, let alone self - correct, which makes the reliability of generated images low. To address the above challenges, the author proposes a unified image generation and editing system - **GenArtist**. The innovation of this system lies in using a multi - modal large - language model (MLLM) as an AI agent. This agent can analyze requirements according to user instructions, decompose complex problems, and formulate specific solutions by constructing a planning tree. In addition, this system also has the ability to perform location - aware tools, can automatically complete missing location - related inputs, and select the most appropriate tool to solve each sub - problem in combination with location information. Through these mechanisms, GenArtist not only improves the reliability of model execution but also significantly enhances the controllability of user instructions on images, achieving unified processing of image generation and editing tasks.

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

LLMGA: Multimodal Large Language Model based Generation Assistant

LLMs Meet Multimodal Generation and Editing: A Survey

RealtimeGen: an Intervenable AI Image Generation System for Commercial Digital Art Asset Creators

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

ToolGen: Unified Tool Retrieval and Calling via Generation

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

CCA: Collaborative Competitive Agents for Image Editing

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

Large-scale Text-to-Image Generation Models for Visual Artists' Creative Works

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Guiding Instruction-based Image Editing via Multimodal Large Language Models

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs