CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL

Mingliang Chen,Weimin Zhang,Yurui Ren,Ge Li
DOI: https://doi.org/10.1109/icip46576.2022.9897206
2022-01-01
Abstract:Video-Text Retrieval aims to perform accurate retrieval process that adopts texts to retrieve the corresponding videos, and vice versa. Typically, mainstream methods solve this problem by learning a common joint embedding space, and then measure the similarities between videos and texts. However, these methods lack the ability to represent detailed semantic information. Therefore, we first utilize three pre-trained models to construct the video embeddings of different semantic levels, and then propose a Context-aware Hierarchical Transformer (CHT) model to encode the context information between these levels. More specifically, our model builds finegrained hierarchical video embeddings of three semantic levels: global, objects, and actions. Attention-based contextual transformers are utilized to establish the context interactions between different semantic levels. Experimental results on two benchmark video-text retrieval datasets demonstrate the superiority of our CHT model. Ablation studies also prove the effectiveness of our proposed model.
What problem does this paper attempt to address?