A Lightweight and Effective Multi-View Knowledge Distillation Framework for Text-Image Retrieval

Yuxiang Song,Yuxuan Zheng,Shangqing Zhao,Shu Liu,Xinlin Zhuang,Zhaoguang Long,Changzhi Sun,Aimin Zhou,Man Lan
DOI: https://doi.org/10.1109/ijcnn60899.2024.10650723
2024-01-01
Abstract:Large-scale dual-stream Vision-Language Pre-training (VLP) models provide an efficient solution for text-image retrieval tasks. Despite this, their performance often falls short of the most current single-stream models, primarily due to limited fine-grained text-image interactions. Recent trends indicate a union of these two types of networks. Some methods adopt a retrieve and rerank strategy, their performance improvements largely hinge on the single-stream encoder during inference. Other approaches utilize knowledge distillation to strengthen either the single-stream encoder or the dual-stream encoder, surpassing their previous capabilities. However, existing distillation techniques typically focus on a single knowledge type, neglecting the richer insights available in the teacher model. To bridge this gap, we introduce a Lightweight and Effective Multi-View Knowledge Distillation approach, named LEMKD, for text-image retrieval. This method effectively utilizes response-based, feature-based and relation-based knowledge, transferring the knowledge from the single-stream encoder to the dual-stream encoder. Our approach is executed on the widely used MS-COCO and Flickr30K datasets. Results demonstrate that LEMKD not only matches the exceptional performance of the most advanced single-stream models but also excels in dual-stream encoder performance amidst the recent integration of single-stream and dual-stream models.
What problem does this paper attempt to address?