Cross-Modality Knowledge Calibration Network for Video Corpus Moment Retrieval

Tongbao Chen,Wenmin Wang,Zhe Jiang,Ruochen Li,Bingshu Wang
DOI: https://doi.org/10.1109/tmm.2023.3316025
IF: 7.3
2023-01-01
IEEE Transactions on Multimedia
Abstract:Video corpus moment retrieval has become a hot topic recently, which aims to localize a consequent video moments highly relevant to the given query language description from video corpus. Existing methods towards this challenging task are suffering from the cases when the visual information and textual information in the video are very different from each other or from the cases where the redundant video content is semantically irrelevant with the query language description, which make the model confused of figuring out the truly useful within- and cross-modality information. In this paper, we propose a novel Cross-Modality Knowledge Calibration Network (CKCN) to solve the issue mentioned above. Specifically, a dual calibration transformer module with improved multi-head attention is proposed to simultaneously capture the within- and cross-modality features between the visual and textual modality of the video automatically compressing the redundant information, and then a query-dependent fusion module is designed to guide feature fusion of the video's multi-modal information using the prior knowledge of query which further refine more important modality features. At last, a query-guided calibration transformer module with a well-designed learnable cell is utilized to align the query and video, forming a single joint representation for moment localization. Meanwhile, we introduce transfer learning into the task of video corpus moment retrieval (VCMR) for the first time to solve the defect of insufficient labeled data. Extensive experiments have been conducted on both the widely used TVR dataset and DiDeMo dataset which have achieved new state-of-the-art, thus verifying the effectiveness of our proposed CKCN.
What problem does this paper attempt to address?