Abstract:Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, \textit{i.e.}, the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with query words in frame-level features while suppressing irrelevant parts. Therefore, the enhanced video contains less redundant semantics and is more balanced with the textual modality. Second, we enhance the textual modality at the segment-sentence level by learning complementary knowledge from context sentences and ground-truth segments. With the knowledge added to the query, the textual modality thus maintains more meaningful semantics and is more balanced with the video modality. By implementing two levels of MESM, the semantic information from both modalities is more balanced to align, thereby bridging the modality gap. Experiments on three widely used benchmarks, including the out-of-distribution settings, show that the proposed framework achieves a new start-of-the-art performance with notable generalization ability (e.g., 4.42% and 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code will be available at <a class="link-external link-https" href="https://github.com/lntzm/MESM" rel="external noopener nofollow">this https URL</a>.

Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

Weakly-Supervised Video Moment Retrieval Via Semantic Completion Network

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context

Video Relation Detection with Spatio-Temporal Graph

Attentive Moment Retrieval in Videos

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Improving semantic video retrieval models by training with a relevance-aware online mining strategy

Siamese Alignment Network for Weakly Supervised Video Moment Retrieval

Structured Multi-Level Interaction Network for Video Moment Localization via Language Query

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Relation Triplet Construction for Cross-modal Text-to-Video Retrieval

Exploiting Semantic And Visual Context For Effective Video Annotation

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval