AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

Junhao Cheng,Xi Lu,Hanhui Li,Khun Loun Zai,Baiqiao Yin,Yuhao Cheng,Yiqiang Yan,Xiaodan Liang
2024-06-11
Abstract:As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the **subject consistency problem** in **multi - round interactive image generation**. Specifically, although current text - to - image (T2I) generation models perform well in generating a single high - quality image, they encounter challenges when generating a series of coherent images in the multi - round interaction process. Especially when the user frequently switches topics, it becomes particularly difficult to maintain the consistency of multiple subjects. #### Main problems: 1. **Multi - subject consistency**: In multi - round interactions, how to ensure that different subjects (such as people, objects, etc.) in the generated image sequence remain consistent and that there are no cases of missing subjects or incorrect fusion. 2. **Flexible multi - round editing**: Support users to perform flexible editing operations in multi - round interactions, such as adding new elements, modifying existing elements, etc., while maintaining the coherence and consistency of the overall image. 3. **Cross - round reference**: In multi - round interactions, the user may refer to previously generated images or conversation content. How to accurately understand and process this reference information? ### Solutions: To solve the above problems, the authors propose a training - free multi - agent framework named **AutoStudio**. This framework contains four core components: 1. **Subject Manager**: Analyze user conversations, identify and manage each subject and its context. 2. **Layout Generator**: Generate fine - grained bounding boxes to control the position of subjects. 3. **Supervisor**: Provide layout optimization suggestions to ensure that the generated layout is reasonable. 4. **Drawer**: Generate high - quality images based on the Stable Diffusion model, and introduce Parallel - UNet and the subject initialization generation method to enhance multi - subject consistency. Through the collaborative work of these components, AutoStudio can generate coherent and high - quality image sequences in multi - round interactions, significantly improving multi - subject consistency and achieving better performance than existing methods in multiple benchmark tests. ### Experimental results: - **Quantitative evaluation**: In the CMIGBench benchmark test, AutoStudio is significantly superior to existing methods in indicators such as the average Fréchet Inception Distance (aFID) and the average character - character similarity (aCCS). - **Qualitative evaluation**: The visualization results show that AutoStudio can understand the user's natural language instructions and generate highly consistent images. - **Ablation experiment**: Verify the effectiveness of each component, especially the key role of the supervisor, P - UNet, and the parallel subject initialization generation method in improving performance. In conclusion, AutoStudio provides a powerful solution for multi - round interactive image generation and solves the key challenge of multi - subject consistency.