ItrievalKD: an Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval

Zhen Liu,Yongxin Zhu,Zhujin Gao,Xin Sheng,Linli Xu
DOI: https://doi.org/10.1007/978-3-031-33380-4_20
2023-01-01
Abstract:Benefiting from the superiority of the pretraining paradigm on large-scale multi-modal data, current cross-modal pretrained models (such as CLIP) have shown excellent performance on text-to-image retrieval. However, the current research mainly focuses on the scenarios with strong matching of images and texts, which is not always available in practice. For example, in social media content or daily communication, the text is not always completely related to the image and may also contain some irrelevant content, which introduces non-negligible noise to text-to-image retrieval. The noisy multi-modal setting is significantly different from the current cross-modal pretraining corpus, which may lead to significant degradation of the retrieval performance of the general image-text retrieval models. In this paper, we focus on the task of noisy text-to-image retrieval and propose an iterative retrieval framework which firstly retrieves the key-semantic information from the noisy text with knowledge distillation, followed by retrieving the relevant image from the image pool with the key-semantic clue. Experiments on Noisy-MSCOCO and PhotoChat datasets confirm the superiority of the proposed iterative retrieval framework in the task of noisy text-to-image retrieval compared with the general retrieval models.
What problem does this paper attempt to address?