Abstract:Video-text retrieval is a crucial task that has been a powerful application for multi-media data analysis and attracted tremendous interest in the research area. The core steps are feature representations and alignment to overcome the heterogeneous gap between videos and texts. Existing methods not only take advantage of multi-modal information in videos but also explore local alignment to enhance retrieval accuracy. Although performing well, these methods seem deficient at three perspectives: a) The semantic correlations between different modal features are not considered, which introduces irrelevant noise in feature representations. b) The cross-modal relations and temporal associations are ambiguously learned by a single self-attention manipulation. c) The training signal to optimize the semantic topic assignment for local alignment is missing. In this paper, we proposed a novel Temporal Multi-modal Graph Transformer with Global-Local Alignment (TMMGT-GLA) for video-text retrieval. We model the input video as a sequence of semantic correlation graphs to exploit the structural information between multi-modal features. Graph and temporal self-attention layers are leveraged on the semantic correlation graphs to effectively learn cross-modal relations and temporal associations respectively. For local alignment, the encoded video and text features are assigned to a set of shared semantic topics, and the distances between residuals from the same ones are minimized. To optimize the assignments, a minimum entropy-based regularization term is proposed for training the overall framework. Experimental results are carried out on the MSR-VTT, LSMDC, and ActivityNet Captions datasets. Our method outperforms previous approaches by a large margin and achieves state-of-the-art performance.

Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Coarse-to-fine dual-level attention for video-text cross modal retrieval

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

A cross-modal conditional mechanism based on attention for text-video retrieval

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.

Multiple cross-attention for video-subtitle moment retrieval

Spatial-temporal Graphs for Cross-modal Text2Video Retrieval

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval.

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Iterative graph attention memory network for cross-modal retrieval

Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval.

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval