Abstract:Video-text retrieval is a crucial task that has been a powerful application for multi-media data analysis and attracted tremendous interest in the research area. The core steps are feature representations and alignment to overcome the heterogeneous gap between videos and texts. Existing methods not only take advantage of multi-modal information in videos but also explore local alignment to enhance retrieval accuracy. Although performing well, these methods seem deficient at three perspectives: a) The semantic correlations between different modal features are not considered, which introduces irrelevant noise in feature representations. b) The cross-modal relations and temporal associations are ambiguously learned by a single self-attention manipulation. c) The training signal to optimize the semantic topic assignment for local alignment is missing. In this paper, we proposed a novel Temporal Multi-modal Graph Transformer with Global-Local Alignment (TMMGT-GLA) for video-text retrieval. We model the input video as a sequence of semantic correlation graphs to exploit the structural information between multi-modal features. Graph and temporal self-attention layers are leveraged on the semantic correlation graphs to effectively learn cross-modal relations and temporal associations respectively. For local alignment, the encoded video and text features are assigned to a set of shared semantic topics, and the distances between residuals from the same ones are minimized. To optimize the assignments, a minimum entropy-based regularization term is proposed for training the overall framework. Experimental results are carried out on the MSR-VTT, LSMDC, and ActivityNet Captions datasets. Our method outperforms previous approaches by a large margin and achieves state-of-the-art performance.

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

TransVOS: Video Object Segmentation with Transformers

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Spatial-temporal Graphs for Cross-modal Text2Video Retrieval

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context

Video Relation Detection with Spatio-Temporal Graph

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations

Based on Spatial and Temporal Implicit Semantic Relational Inference for Cross-Modal Retrieval

Relation Triplet Construction for Cross-modal Text-to-Video Retrieval

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Fine-grained Text-Video Retrieval with Frozen Image Encoders

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL

A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.