Abstract:Video-text retrieval is a crucial task that has been a powerful application for multi-media data analysis and attracted tremendous interest in the research area. The core steps are feature representations and alignment to overcome the heterogeneous gap between videos and texts. Existing methods not only take advantage of multi-modal information in videos but also explore local alignment to enhance retrieval accuracy. Although performing well, these methods seem deficient at three perspectives: a) The semantic correlations between different modal features are not considered, which introduces irrelevant noise in feature representations. b) The cross-modal relations and temporal associations are ambiguously learned by a single self-attention manipulation. c) The training signal to optimize the semantic topic assignment for local alignment is missing. In this paper, we proposed a novel Temporal Multi-modal Graph Transformer with Global-Local Alignment (TMMGT-GLA) for video-text retrieval. We model the input video as a sequence of semantic correlation graphs to exploit the structural information between multi-modal features. Graph and temporal self-attention layers are leveraged on the semantic correlation graphs to effectively learn cross-modal relations and temporal associations respectively. For local alignment, the encoded video and text features are assigned to a set of shared semantic topics, and the distances between residuals from the same ones are minimized. To optimize the assignments, a minimum entropy-based regularization term is proposed for training the overall framework. Experimental results are carried out on the MSR-VTT, LSMDC, and ActivityNet Captions datasets. Our method outperforms previous approaches by a large margin and achieves state-of-the-art performance.

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Multimodal Graph for Unaligned Multimodal Sequence Analysis via Graph Convolution and Graph Pooling

MLGAT: multi-layer graph attention networks for multimodal emotion recognition in conversations

Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion

Multimodal Transformer for Unaligned Multimodal Language Sequences

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

MMGA: Multimodal Learning with Graph Alignment

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Multimodal Sentiment Analysis with Temporal Modality Modality

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Cross-modality reinforcement for unaligned sequences sentiment analysis

GraphMFT: A Graph Network based Multimodal Fusion Technique for Emotion Recognition in Conversation

Graph Capsule Aggregation for Unaligned Multimodal Sequences

Multi-Channel Attentive Graph Convolutional Network with Sentiment Fusion for Multimodal Sentiment Analysis

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

Target and Source Modality Co-Reinforcement for Emotion Understanding from Asynchronous Multimodal Sequences.