Spatial-temporal Graphs for Cross-modal Text2Video Retrieval
Xue Song,Jingjing Chen,Zuxuan Wu,Yu-Gang Jiang
DOI: https://doi.org/10.1109/tmm.2021.3090595
IF: 7.3
2021-01-01
IEEE Transactions on Multimedia
Abstract:Cross-modal text to video retrieval aims to find relevant videos given text queries, which is crucial for various real-world applications. The key to address this task is to build the correspondence between video and text such that related samples from different modalities can be aligned. As the text (sentence) contains both nouns and verbs representing objects as well as their interactions, retrieving relevant videos requires a fine-grained understanding of video contents—not only the semantic concepts (i.e., objects) but also the interactions between them. Nevertheless, current approaches mostly represent videos with aggregated frame-level features for the learning of joint space and ignore the information of object interactions, which usually results in suboptimal retrieval performance. To improve the performance of cross-modal video retrieval, this paper proposes a framework that models videos as spatial-temporal graphs where nodes correspond to visual objects and edges correspond to the relations/interactions between objects. With the spatial-temporal graphs, object interactions in frame sequences can be captured to enrich the video representations for joint space learning. Specifically, Graph Convolutional Network is introduced to learn the representations on spatial-temporal graphs, aiming to encode spatial-temporal interactions between objects; while BERT is introduced to dynamically encode the sentence according to the context for cross-modal retrieval. Extensive experiments verify the effectiveness of the proposed framework and it achieves promising performances on both MSR-VTT and LSMDC datasets.
computer science, information systems,telecommunications, software engineering