Distilled Dual-Encoder Model for Vision-Language Understanding

Zekun Wang,Wenhui Wang,Haichao Zhu,Ming Liu,Bing Qin,Furu Wei

DOI: https://doi.org/10.48550/arXiv.2112.08723

2022-10-18

Abstract:We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at <a class="link-external link-https" href="https://github.com/kugwzk/Distilled-DualEncoder" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in visual - language understanding (VLU) tasks, although the fusion encoder model has superior performance, it is inefficient, while the dual - encoder model is efficient but has insufficient performance when dealing with complex cross - modal interactions. The author proposes a knowledge distillation framework DIDE, aiming to enhance the cross - modal interaction ability of the dual - encoder model (student model) by extracting knowledge from the fusion encoder model (teacher model), so as to achieve performance comparable to that of the fusion encoder model while maintaining high efficiency. Specifically, the main contributions of the paper include: 1. **Proposing DIDE**: A knowledge distillation framework for the dual - encoder model, which is used to learn more complex cross - modal interactions from the fusion encoder model. 2. **Plug - in method**: This method can be applied to different visual - language tasks and is suitable for different model architectures. 3. **Experimental results**: Experiments show that DIDE performs excellently on various VLU tasks, with performance close to that of the teacher model (retaining 96.9% - 99.9% of the performance), while the inference speed is increased by 4 times. 4. **Further analysis**: The analysis shows that the proposed cross - modal attention distillation is a key factor for success. Compared with distillation methods using only soft labels or other latent features, cross - modal attention distillation brings significant performance improvement. Through these contributions, the paper shows how to make the dual - encoder model achieve better performance in complex visual - language understanding tasks while maintaining high efficiency.

Distilled Dual-Encoder Model for Vision-Language Understanding

VideoDistill: Language-aware Vision Distillation for Video Question Answering

Language-aware Visual Semantic Distillation for Video Question Answering

Towards Better Entity Linking with Multi-View Enhanced Distillation

ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder Via Self On-the-fly Distillation for Dense Passage Retrieval

ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

EVLM: An Efficient Vision-Language Model for Visual Understanding

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Dual-feature collaborative relation-attention networks for visual question answering

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

Modular dual-stream visual fusion network for visual question answering

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Towards More Unified In-context Visual Understanding

Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation

A multimodal attention fusion network with a dynamic vocabulary for TextVQA