Abstract:Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.

A multimodal dialogue system for improving user satisfaction via knowledge-enriched response and image recommendation

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model.

Read Key Points: Dialogue-Grounded Knowledge Points Generation with Multi-Level Salience-Aware Mixture

Engaging Live Video Comments Generation

Dual Semantic Knowledge Composed Multimodal Dialog Systems

Knowledge-aware Multimodal Dialogue Systems.

A Survey on Multimodal Dialogue Systems: Recent Advances and New Frontiers

User Attention-guided Multimodal Dialog Systems

Multimodal Dialogue Response Generation Based on Selective Attention and Gating Mechanisms

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic Elements

Forward Creation, Reverse Selection: Achieving Highly Pertinent Multimodal Responses in Dialogue Contexts

UniMF: A Unified Framework to Incorporate Multimodal Knowledge Bases Intoend-to-end Task-Oriented Dialogue Systems

Multi-modal multi-hop interaction network for dialogue response generation

More to diverse: Generating diversified responses in a task oriented multimodal dialog system

Response Generation in Multi-Modal Dialogues with Split Pre-Generation and Cross-Modal Contrasting

Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements

I Know You Better: User Profile Aware Personalized Dialogue Generation

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation.

End-to-End Personalized Humorous Response Generation in Untrimmed Multi-Role Dialogue System.

UniMS-RAG: A Unified Multi-source Retrieval-Augmented Generation for Personalized Dialogue Systems