DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Minbin Huang,Yanxin Long,Xinchi Deng,Ruihang Chu,Jiangfeng Xiong,Xiaodan Liang,Hong Cheng,Qinglin Lu,Wei Liu

2024-07-03

Abstract:Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper proposes a solution to the problem of multi-modal text-to-image generation in the Multi-modal Interactive Dialogue System (MIDS). Although current Text-to-Image (T2I) generation models have made significant progress, it is still challenging for ordinary users to effectively interact with these models, requiring specialized prompt engineering knowledge, and they cannot perform multi-round image generation, limiting the dynamic and iterative creative process. The paper introduces DialogGen, a pipeline system that aligns pre-trained Multi-modal Large Language Models (MLLMs) with T2I models, to build a MIDS capable of performing multi-round multi-modal tasks in response to natural language instructions from users, meeting the needs of image generation, image editing, and chat. DialogGen achieves this goal through drawing prompt alignments, carefully curated training data management, and error correction. In addition, the paper proposes a multi-modal dialogue benchmark called DialogBen, which is used to evaluate the performance of MIDS in terms of output modality correctness and multi-modal output coherence. DialogBen includes two evaluation metrics that measure the model's ability in modality switching and output image coherence. Experiments show that DialogGen demonstrates superiority in generating correct output modalities and coherent multi-modal outputs compared to other state-of-the-art models. The paper hopes that DialogBen can promote the construction of more powerful MIDS.

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

Multimodal Dialogue Response Generation Based on Selective Attention and Gating Mechanisms

Engaging Live Video Comments Generation

What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation.

BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Multi-modal Generation via Cross-Modal In-Context Learning

Modality-Balanced Models for Visual Dialogue

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Simple Dialogue System with AUDITED

DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset

A New Dialogue Response Generation Agent for Large Language Models by Asking Questions to Detect User's Intentions