Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization
Yupeng Hu,Liqiang Nie,Meng Liu,Kun Wang,Yinglong Wang,Xian-Sheng Hua
DOI: https://doi.org/10.1109/tip.2021.3090521
IF: 10.6
2021-01-01
IEEE Transactions on Image Processing
Abstract:Video moment localization, as an important branch of video content analysis, has attracted extensive attention in recent years. However, it is still in its infancy due to the following challenges: cross-modal semantic alignment and localization efficiency. To address these impediments, we present a cross-modal semantic alignment network. To be specific, we first design a video encoder to generate moment candidates, learn their representations, as well as model their semantic relevance. Meanwhile, we design a query encoder for diverse query intention understanding. Thereafter, we introduce a multi-granularity interaction module to deeply explore the semantic correlation between multi-modalities. Thereby, we can effectively complete target moment localization via sufficient cross-modal semantic understanding. Moreover, we introduce a semantic pruning strategy to reduce cross-modal retrieval overhead, improving localization efficiency. Experimental results on two benchmark datasets have justified the superiority of our model over several state-of-the-art competitors.
computer science, artificial intelligence,engineering, electrical & electronic