Relation Triplet Construction for Cross-modal Text-to-Video Retrieval

Xue Song,Jingjing Chen,Yu-Gang Jiang
DOI: https://doi.org/10.1145/3581783.3611940
2023-01-01
Abstract:Cross-modal text-to-video retrieval aims to find semantically related videos for a text query. Since video and text are distinct modalities, the major challenge comes from building the correspondence between two modalities, thus relevant samples could be matched. Inherently, the text contains multiple relatively complete semantic units and each one is composed of three primary components, i.e., subject, predicate and object (SVO triplet). Therefore, it requires similar modeling of video content -- objects and their relations, to correctly retrieve videos for texts. To model fine-grained visual relations, this paper proposes a Multi-Granularity Matching (MGM) framework that considers both fine-grained relation triplet matching and coarse-grained global semantic matching for text-to-video retrieval. Specifically, in the proposed framework, we represent videos as SVO triplet tracklets by extracting frame-level relation triplets followed by temporal relation association across frames. Moreover, we design a transformer-based Bi-directional Fusion Block (BFB) to express each SVO triplet with a highly unified representation. The constructed SVO triplet tracklets provide a reasonable way to model fine-grained video contents, fulfilling a better alignment between videos and texts. Extensive experiments conducted on three benchmark datasets, i.e., MSR-VTT, LSMDC and MSVD, demonstrate the effectiveness of our proposed method.
What problem does this paper attempt to address?