Abstract:Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.

Non-Autoregressive Neural Dialogue Generation

Hagan: Hierarchical Attentive Adversarial Learning For Task-Oriented Dialogue System

Deep Reinforcement Learning for Dialogue Generation

Adversarial Learning for Neural Dialogue Generation.

A Simple, Fast Diverse Decoding Algorithm for Neural Generation

Thinking Clearly, Talking Fast: Concept-Guided Non-Autoregressive Generation for Open-Domain Dialogue Systems

Non-Autoregressive Dialog State Tracking

Adaptive Bridge Between Training and Inference for Dialogue Generation

A Diversity-Promoting Objective Function for Neural Conversation Models

Multi-turn Dialogue Generation Using Self-attention and Nonnegative Matrix Factorization

Multiresolution Recurrent Neural Networks: an Application to Dialogue Response Generation

Diversifying Dialog Generation via Adaptive Label Smoothing

Dynamic Stochastic Decoding Strategy for Open-Domain Dialogue Generation

Neural Response Generation with Meta-Words

A Static and Dynamic Attention Framework for Multi Turn Dialogue Generation

Tree-Structured Neural Machine for Linguistics-Aware Sentence Generation

Context-Controlled Topic-Aware Neural Response Generation for Open-Domain Dialog Systems

Neural Response Generation with Dynamic Vocabularies

Knowledge-aware Attentive Wasserstein Adversarial Dialogue Response Generation

Retrieval-Generation Alignment for End-to-End Task-Oriented Dialogue System

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation