Hierarchical Bi-Directional Conceptual Interaction for Text-Video Retrieval

Wenpeng Han,Guanglin Niu,Mingliang Zhou,Xiaowei Zhang
DOI: https://doi.org/10.1007/s00530-024-01525-3
IF: 3.9
2024-01-01
Multimedia Systems
Abstract:The large pre-trained vision-language models (VLMs) utilized in text-video retrieval have demonstrated strong cross image-text understanding ability. Existing works leverage VLMs to extract features and design fine-grained uni-directional interaction from text to video to enhance the visual understanding ability of the model. However, the vast cross-modal gap makes it difficult to fully match video-text mutual information solely through uni-directional cross-modal interaction techniques. To this end, we propose a novel hierarchical bi-directional conceptual interaction (HBCI) method, which utilizes multi-granularity video-text decoupled features mutual attention to enhance cross-modal alignment. Firstly, we introduce the text-guided attention to extract visual representations among hierarchical concepts, and decouple the multi-granularity features from video and text to find representation subspaces with maximal relevance to each other. Furthermore, we construct an iterative bi-directional conceptual interaction (BCI) module to reason semantic associations across text and video modalities, which generates attention weights adaptively based on video-text decoupled concepts and projects them into the other modality to facilitate profound cross-modal interaction. Finally, we implement the cross-level similarity distillation to progressively propagate the knowledge-aware similarity. Extensive experiments consistently deliver exceptional performance of our proposed HBCI across MSR-VTT, DiDeMo and ActivityNet datasets.
What problem does this paper attempt to address?