Abstract:Video-text retrieval is a crucial task that has been a powerful application for multi-media data analysis and attracted tremendous interest in the research area. The core steps are feature representations and alignment to overcome the heterogeneous gap between videos and texts. Existing methods not only take advantage of multi-modal information in videos but also explore local alignment to enhance retrieval accuracy. Although performing well, these methods seem deficient at three perspectives: a) The semantic correlations between different modal features are not considered, which introduces irrelevant noise in feature representations. b) The cross-modal relations and temporal associations are ambiguously learned by a single self-attention manipulation. c) The training signal to optimize the semantic topic assignment for local alignment is missing. In this paper, we proposed a novel Temporal Multi-modal Graph Transformer with Global-Local Alignment (TMMGT-GLA) for video-text retrieval. We model the input video as a sequence of semantic correlation graphs to exploit the structural information between multi-modal features. Graph and temporal self-attention layers are leveraged on the semantic correlation graphs to effectively learn cross-modal relations and temporal associations respectively. For local alignment, the encoded video and text features are assigned to a set of shared semantic topics, and the distances between residuals from the same ones are minimized. To optimize the assignments, a minimum entropy-based regularization term is proposed for training the overall framework. Experimental results are carried out on the MSR-VTT, LSMDC, and ActivityNet Captions datasets. Our method outperforms previous approaches by a large margin and achieves state-of-the-art performance.

Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision

Learning Video-Text Aligned Representations for Video Captioning

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

Video-Language Alignment via Spatio-Temporal Graph Transformer

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment.

Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Text-Video Retrieval with Global-Local Semantic Consistent Learning

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

A Hybird Alignment Loss for Temporal Moment Localization with Natural Language