ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

Bo Zhang,Jian Wang,Hui Ma,Bo Xu,Hongfei Lin
DOI: https://doi.org/10.1145/3581783.3611810
2023-08-02
Abstract:Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains. The code is available at <a class="link-external link-https" href="https://github.com/zhangbo-nlp/ZRIGF" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Multimedia
What problem does this paper attempt to address?
### The Problem the Paper Aims to Solve This paper aims to address the challenges of image-grounded dialogue generation in zero-resource scenarios. Specifically, the researchers propose an innovative multimodal framework named **ZRIGF** (Zero-Resource Image-Grounded Dialogue Generation Framework) to overcome the shortcomings of existing methods in handling the modality differences between images and text. #### Main Issues: 1. **Modality Gap between Images and Text**: Existing models struggle to effectively integrate image information into dialogue generation, especially in zero-resource scenarios. 2. **Scarcity of Training Data in Zero-Resource Scenarios**: The lack of large-scale dialogue datasets naturally associated with images makes it difficult for models to generalize to new domains. 3. **Insufficient Model Generalization Ability**: Current methods mainly focus on using retrieved images to enhance response quality but overlook the generalization ability in zero-resource scenarios. #### Solution: To address these challenges, the authors propose a two-stage learning strategy: contrastive pre-training and generative pre-training. Through this strategy, ZRIGF can effectively integrate multimodal information and demonstrate good generalization ability in new, unlabeled domains. - **Contrastive Pre-training**: Includes the Text-Image Matching Module (TIM) and Text-Assisted Masked Image Modeling (TAMIM) module, used to align different modality vectors. - **Generative Pre-training**: Includes the Multimodal Fusion Module (MF) and Information Transfer Module (IT), used to generate meaningful responses based on the aligned multimodal representations. In this way, ZRIGF can maintain good performance and generalization ability even without annotated data.