Self-expressive induced clustered attention for video-text retrieval
Jingxuan Zhu,Xiangjun Shen,Sumet Mehta,Timothy Apasiba Abeo,Yongzhao Zhan
DOI: https://doi.org/10.1007/s00530-024-01549-9
IF: 3.9
2024-12-01
Multimedia Systems
Abstract:Extensive research has proven that self-attention achieves impressive performance in video-text retrieval. However, most state-of-the-art methods neglect the intrinsic redundancy in videos caused by consecutive and similar frames, making them difficult to construct a well-defined fine-grained semantic space and improve the performance retrieval limited. Otherwise, current self-attentions exhibit high complexity when computing frame-word attention coefficient. It leads to high cost of computational and storage resources when employing these attentions for video-text retrieval. To solve these problems, we propose a new method of cluster induced self-expressive induced clustered attention for video-text retrieval. In this method, unlike existing methods, we perform self-expressive induced clustering (SEIC) on video embedding to mine well-defined fine-grained video semantic features, which is a self-adaptive clustering method without giving the special the number of video embedding pair, can capture well-defined fine-grained semantic features from video embedding and reduce the redundancy of video-frame-level content. Then a self-expressive induced clustered attention model (SEICA) is proposed, which can enhance the quality of video embedding, reduce the computational cost and save storage resources effectively. Finally, we apply this method to video-text retrieval tasks. Experimental results on several benchmark datasets such as MSVD, MSRVTT, ActivityNet and DiDeMo. demonstrate that the retrieval performance of the proposed method is superior to that of the relative stat of the art methods with less consumption of computing and storage resources.
computer science, information systems, theory & methods