MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

Sen Li,Ruochen Wang,Cho-Jui Hsieh,Minhao Cheng,Tianyi Zhou

2024-05-24

Abstract:Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. To efficiently address these challenges, we develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object with intricate planning and feedback control. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion, conditioned on previously generated objects. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined upon each sub-task by an LLM and attention guidance. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. The multi-step process also allows human users to monitor the generation process and make preferred changes at any intermediate step via text prompts, thereby improving the human-AI collaboration experience. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines and its creativity when collaborating with human users. The code is available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper proposes a solution to the challenges faced by complex text-to-image generation models when creating images containing multiple objects, particularly in handling their spatial positions, relative sizes, overlaps, and attribute bindings. Existing text-to-image models, such as Stable Diffusion, do not perform well in this aspect. To address these issues, the paper develops an untrained multimodal large-scale language model agent called MuLan, which can generate multiple objects with detailed planning and feedback control in a step-by-step manner. MuLan decomposes the prompt into a series of subtasks using the large-scale language model, with each subtask generating an object independently through stable diffusion conditioning on the previously generated objects. Unlike existing LLM-based methods, MuLan only generates high-level plans at the beginning, while the exact size and position of each object are determined by LLM and attention guidance in each subtask. Additionally, MuLan utilizes a visual language model (VLM) to provide feedback on the generated image for each subtask, and if the original prompt is violated, the controlled diffusion model regenerates the image. This approach allows users to monitor and make desired changes through text prompts during the generation process, improving the human-computer collaborative experience. The paper evaluates MuLan by collecting 200 prompts containing multiple objects and spatial relationships, and the results show that MuLan outperforms the baseline in generating multiple objects and exhibits creativity when collaborating with human users.

MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

LLMGA: Multimodal Large Language Model based Generation Assistant

MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Large Multimodal Agents: A Survey

Mixture-of-Agents Enhances Large Language Model Capabilities