GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

Zhenyu Wang,Aoxue Li,Zhenguo Li,Xihui Liu
2024-10-28
Abstract:Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. Project page is <a class="link-external link-https" href="https://zhenyuw16.github.io/GenArtist_page" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects: 1. **High Diversity and Variability of Image Generation and Editing Requirements**: Existing image generation or editing methods are insufficient when facing users' diverse requirements, such as requirements for different objects and backgrounds, and specific requirements for various operations in text prompts. Different models often have different advantages and focuses. General models may not be as powerful as fine - tuned models in some aspects, but perform better when dealing with unseen data. 2. **Ability to Handle Complex Problems**: Current models still have difficulties in handling complex tasks, such as long and complex sentences in text - to - image tasks, or multi - step complex instructions in editing tasks. Although expanding the model scale or fine - tuning can alleviate this problem, due to the high variability and flexibility of text, there will always be some complex problems that are difficult to be effectively handled by well - trained models. 3. **Model Reliability**: Even if designed very elaborately, models will inevitably encounter some failure situations. Generated images sometimes cannot accurately correspond to the content of users' prompts. Existing models lack the ability to independently evaluate the correctness of generated images, let alone self - correct, which makes the reliability of generated images low. To address the above challenges, the author proposes a unified image generation and editing system - **GenArtist**. The innovation of this system lies in using a multi - modal large - language model (MLLM) as an AI agent. This agent can analyze requirements according to user instructions, decompose complex problems, and formulate specific solutions by constructing a planning tree. In addition, this system also has the ability to perform location - aware tools, can automatically complete missing location - related inputs, and select the most appropriate tool to solve each sub - problem in combination with location information. Through these mechanisms, GenArtist not only improves the reliability of model execution but also significantly enhances the controllability of user instructions on images, achieving unified processing of image generation and editing tasks.