Abstract:Video-text retrieval is a crucial task that has been a powerful application for multi-media data analysis and attracted tremendous interest in the research area. The core steps are feature representations and alignment to overcome the heterogeneous gap between videos and texts. Existing methods not only take advantage of multi-modal information in videos but also explore local alignment to enhance retrieval accuracy. Although performing well, these methods seem deficient at three perspectives: a) The semantic correlations between different modal features are not considered, which introduces irrelevant noise in feature representations. b) The cross-modal relations and temporal associations are ambiguously learned by a single self-attention manipulation. c) The training signal to optimize the semantic topic assignment for local alignment is missing. In this paper, we proposed a novel Temporal Multi-modal Graph Transformer with Global-Local Alignment (TMMGT-GLA) for video-text retrieval. We model the input video as a sequence of semantic correlation graphs to exploit the structural information between multi-modal features. Graph and temporal self-attention layers are leveraged on the semantic correlation graphs to effectively learn cross-modal relations and temporal associations respectively. For local alignment, the encoded video and text features are assigned to a set of shared semantic topics, and the distances between residuals from the same ones are minimized. To optimize the assignments, a minimum entropy-based regularization term is proposed for training the overall framework. Experimental results are carried out on the MSR-VTT, LSMDC, and ActivityNet Captions datasets. Our method outperforms previous approaches by a large margin and achieves state-of-the-art performance.

CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Multi-Scale Temporal Difference Transformer for Video-Text Retrieval

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

UATVR: Uncertainty-Adaptive Text-Video Retrieval

RESTHT: relation-enhanced spatial–temporal hierarchical transformer for video captioning

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Contrastive Transformer Hashing for Compact Video Representation

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

Fine-grained Text-Video Retrieval with Frozen Image Encoders

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment.

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data