Abstract:Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, \textit{i.e.}, the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with query words in frame-level features while suppressing irrelevant parts. Therefore, the enhanced video contains less redundant semantics and is more balanced with the textual modality. Second, we enhance the textual modality at the segment-sentence level by learning complementary knowledge from context sentences and ground-truth segments. With the knowledge added to the query, the textual modality thus maintains more meaningful semantics and is more balanced with the video modality. By implementing two levels of MESM, the semantic information from both modalities is more balanced to align, thereby bridging the modality gap. Experiments on three widely used benchmarks, including the out-of-distribution settings, show that the proposed framework achieves a new start-of-the-art performance with notable generalization ability (e.g., 4.42% and 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code will be available at <a class="link-external link-https" href="https://github.com/lntzm/MESM" rel="external noopener nofollow">this https URL</a>.

Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-training Towards Retrieval.

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

Global and Local Semantic Completion Learning for Vision-Language Pre-training

Text-Video Retrieval with Global-Local Semantic Consistent Learning

Unified Video-Language Pre-training with Synchronized Audio

Multilevel Semantic Interaction Alignment for Video–Text Cross-Modal Retrieval

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Temporal Perceiving Video-Language Pre-training

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning.

Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division

Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning

Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

Efficient Transfer Learning for Video-language Foundation Models

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Learning Video-Text Aligned Representations for Video Captioning