ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

Bo Zhang,Jian Wang,Hui Ma,Bo Xu,Hongfei Lin

DOI: https://doi.org/10.1145/3581783.3611810

2023-08-02

Abstract:Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains. The code is available at <a class="link-external link-https" href="https://github.com/zhangbo-nlp/ZRIGF" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Multimedia

What problem does this paper attempt to address?

### The Problem the Paper Aims to Solve This paper aims to address the challenges of image-grounded dialogue generation in zero-resource scenarios. Specifically, the researchers propose an innovative multimodal framework named **ZRIGF** (Zero-Resource Image-Grounded Dialogue Generation Framework) to overcome the shortcomings of existing methods in handling the modality differences between images and text. #### Main Issues: 1. **Modality Gap between Images and Text**: Existing models struggle to effectively integrate image information into dialogue generation, especially in zero-resource scenarios. 2. **Scarcity of Training Data in Zero-Resource Scenarios**: The lack of large-scale dialogue datasets naturally associated with images makes it difficult for models to generalize to new domains. 3. **Insufficient Model Generalization Ability**: Current methods mainly focus on using retrieved images to enhance response quality but overlook the generalization ability in zero-resource scenarios. #### Solution: To address these challenges, the authors propose a two-stage learning strategy: contrastive pre-training and generative pre-training. Through this strategy, ZRIGF can effectively integrate multimodal information and demonstrate good generalization ability in new, unlabeled domains. - **Contrastive Pre-training**: Includes the Text-Image Matching Module (TIM) and Text-Assisted Masked Image Modeling (TAMIM) module, used to align different modality vectors. - **Generative Pre-training**: Includes the Multimodal Fusion Module (MF) and Information Transfer Module (IT), used to generate meaningful responses based on the aligned multimodal representations. In this way, ZRIGF can maintain good performance and generalization ability even without annotated data.

ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

VGDIFFZERO: Text-To-Image Diffusion Models Can Be Zero-Shot Visual Grounders.

Open Domain Dialogue Generation with Latent Images

BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation

Zero-Resource Knowledge-Grounded Dialogue Generation

ChatZero:Zero-shot Cross-Lingual Dialogue Generation via Pseudo-Target Language

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Engaging Live Video Comments Generation

ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Knowledge enhanced zero-resource machine translation using image-pivoting

Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System

Zero-Resource Neural Machine Translation with Multi-Agent Communication Game

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

Towards Unified Interactive Visual Grounding in The Wild

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Forward Creation, Reverse Selection: Achieving Highly Pertinent Multimodal Responses in Dialogue Contexts