Abstract:As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the **subject consistency problem** in **multi - round interactive image generation**. Specifically, although current text - to - image (T2I) generation models perform well in generating a single high - quality image, they encounter challenges when generating a series of coherent images in the multi - round interaction process. Especially when the user frequently switches topics, it becomes particularly difficult to maintain the consistency of multiple subjects. #### Main problems: 1. **Multi - subject consistency**: In multi - round interactions, how to ensure that different subjects (such as people, objects, etc.) in the generated image sequence remain consistent and that there are no cases of missing subjects or incorrect fusion. 2. **Flexible multi - round editing**: Support users to perform flexible editing operations in multi - round interactions, such as adding new elements, modifying existing elements, etc., while maintaining the coherence and consistency of the overall image. 3. **Cross - round reference**: In multi - round interactions, the user may refer to previously generated images or conversation content. How to accurately understand and process this reference information? ### Solutions: To solve the above problems, the authors propose a training - free multi - agent framework named **AutoStudio**. This framework contains four core components: 1. **Subject Manager**: Analyze user conversations, identify and manage each subject and its context. 2. **Layout Generator**: Generate fine - grained bounding boxes to control the position of subjects. 3. **Supervisor**: Provide layout optimization suggestions to ensure that the generated layout is reasonable. 4. **Drawer**: Generate high - quality images based on the Stable Diffusion model, and introduce Parallel - UNet and the subject initialization generation method to enhance multi - subject consistency. Through the collaborative work of these components, AutoStudio can generate coherent and high - quality image sequences in multi - round interactions, significantly improving multi - subject consistency and achieving better performance than existing methods in multiple benchmark tests. ### Experimental results: - **Quantitative evaluation**: In the CMIGBench benchmark test, AutoStudio is significantly superior to existing methods in indicators such as the average Fréchet Inception Distance (aFID) and the average character - character similarity (aCCS). - **Qualitative evaluation**: The visualization results show that AutoStudio can understand the user's natural language instructions and generate highly consistent images. - **Ablation experiment**: Verify the effectiveness of each component, especially the key role of the supervisor, P - UNet, and the parallel subject initialization generation method in improving performance. In conclusion, AutoStudio provides a powerful solution for multi - round interactive image generation and solves the key challenge of multi - subject consistency.

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation

Subject-driven Text-to-Image Generation via Apprenticeship Learning

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

Training-Free Consistent Text-to-Image Generation

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Unified Text-to-Image Generation and Retrieval

FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

DreamTuner: Single Image is Enough for Subject-Driven Generation

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

Cones 2: Customizable Image Synthesis with Multiple Subjects

Multi-Subject Personalization