Distilled Dual-Encoder Model for Vision-Language Understanding

Zekun Wang,Wenhui Wang,Haichao Zhu,Ming Liu,Bing Qin,Furu Wei
DOI: https://doi.org/10.48550/arXiv.2112.08723
2022-10-18
Abstract:We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at <a class="link-external link-https" href="https://github.com/kugwzk/Distilled-DualEncoder" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in visual - language understanding (VLU) tasks, although the fusion encoder model has superior performance, it is inefficient, while the dual - encoder model is efficient but has insufficient performance when dealing with complex cross - modal interactions. The author proposes a knowledge distillation framework DIDE, aiming to enhance the cross - modal interaction ability of the dual - encoder model (student model) by extracting knowledge from the fusion encoder model (teacher model), so as to achieve performance comparable to that of the fusion encoder model while maintaining high efficiency. Specifically, the main contributions of the paper include: 1. **Proposing DIDE**: A knowledge distillation framework for the dual - encoder model, which is used to learn more complex cross - modal interactions from the fusion encoder model. 2. **Plug - in method**: This method can be applied to different visual - language tasks and is suitable for different model architectures. 3. **Experimental results**: Experiments show that DIDE performs excellently on various VLU tasks, with performance close to that of the teacher model (retaining 96.9% - 99.9% of the performance), while the inference speed is increased by 4 times. 4. **Further analysis**: The analysis shows that the proposed cross - modal attention distillation is a key factor for success. Compared with distillation methods using only soft labels or other latent features, cross - modal attention distillation brings significant performance improvement. Through these contributions, the paper shows how to make the dual - encoder model achieve better performance in complex visual - language understanding tasks while maintaining high efficiency.