DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Minbin Huang,Yanxin Long,Xinchi Deng,Ruihang Chu,Jiangfeng Xiong,Xiaodan Liang,Hong Cheng,Qinglin Lu,Wei Liu
2024-07-03
Abstract:Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes a solution to the problem of multi-modal text-to-image generation in the Multi-modal Interactive Dialogue System (MIDS). Although current Text-to-Image (T2I) generation models have made significant progress, it is still challenging for ordinary users to effectively interact with these models, requiring specialized prompt engineering knowledge, and they cannot perform multi-round image generation, limiting the dynamic and iterative creative process. The paper introduces DialogGen, a pipeline system that aligns pre-trained Multi-modal Large Language Models (MLLMs) with T2I models, to build a MIDS capable of performing multi-round multi-modal tasks in response to natural language instructions from users, meeting the needs of image generation, image editing, and chat. DialogGen achieves this goal through drawing prompt alignments, carefully curated training data management, and error correction. In addition, the paper proposes a multi-modal dialogue benchmark called DialogBen, which is used to evaluate the performance of MIDS in terms of output modality correctness and multi-modal output coherence. DialogBen includes two evaluation metrics that measure the model's ability in modality switching and output image coherence. Experiments show that DialogGen demonstrates superiority in generating correct output modalities and coherent multi-modal outputs compared to other state-of-the-art models. The paper hopes that DialogBen can promote the construction of more powerful MIDS.