SPSD: Similarity-preserving self-distillation for video–text retrieval

Jiachen Wang,Yan Hua,Yingyun Yang,Hongwei Kou
DOI: https://doi.org/10.1007/s13735-023-00298-1
2023-09-03
International Journal of Multimedia Information Retrieval
Abstract:Most of existing methods solve cross-modal video and text retrieval via coarse-grained similarity computation based on global representations or fine-grained cross-modal interaction. The former misses sufficient information, while the latter suffers from inferior efficiency in inference. Furthermore, hierarchical features of transformer have not been fully utilized in cross-modal contrastive learning. In this paper, we propose similarity-preserving self-distillation method (SPSD) to achieve video and text alignment by cross-granularity and cross-layer ways. For cross-granularity self-distillation, fine-grained cross-modal similarity based on video and text token-wise interaction is transferred to coarse-grained similarity based on global video and text representations. To utilize hierarchical features of deep video and text transformer encoders, we propose cross-layer self-distillation by regarding cross-modal similarity based on semantic features as teacher to provide soft label for the similarity learning based on low-level features. Besides, we construct hierarchical contrastive loss and cross-granularity self-distillation loss at both feature and semantic levels for training transformer-based video and text encoders. SPSD sufficiently utilizes the fine-grained cross-modal interaction and hierarchical transformer features by generating distillation signals through network itself in training stage. In retrieval inference, cross-modal similarity computation between video and text is based on semantic-level global embeddings. Our SPSD achieves outstanding performance for video–text retrieval on MSRVTT, ActivityNet and LSMDC datasets. Our code is available at https://github.com/Macro-1998/SPSD/.
computer science, artificial intelligence, software engineering
What problem does this paper attempt to address?