Abstract:Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, \textit{i.e.}, the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with query words in frame-level features while suppressing irrelevant parts. Therefore, the enhanced video contains less redundant semantics and is more balanced with the textual modality. Second, we enhance the textual modality at the segment-sentence level by learning complementary knowledge from context sentences and ground-truth segments. With the knowledge added to the query, the textual modality thus maintains more meaningful semantics and is more balanced with the video modality. By implementing two levels of MESM, the semantic information from both modalities is more balanced to align, thereby bridging the modality gap. Experiments on three widely used benchmarks, including the out-of-distribution settings, show that the proposed framework achieves a new start-of-the-art performance with notable generalization ability (e.g., 4.42% and 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code will be available at <a class="link-external link-https" href="https://github.com/lntzm/MESM" rel="external noopener nofollow">this https URL</a>.

Support-Set Based Multi-Modal Representation Enhancement for Video Captioning

Multimodality-guided Visual-Caption Semantic Enhancement

Multi-level video captioning method based on semantic space

Semantic-Driven Saliency-Context Separation for Video Captioning

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Center-enhanced video captioning model with multimodal semantic alignment

Learning Video-Text Aligned Representations for Video Captioning

Multimodal Semantic Attention Network for Video Captioning

Multi-Modal interpretable automatic video captioning

Richer Semantic Visual and Language Representation for Video Captioning

Enhanced Video Caption Generation Based on Multimodal Features.

Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning.

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Multimodal Memory Modelling for Video Captioning

VMSG: a video caption network based on multimodal semantic grouping and semantic attention

Motion Guided Region Message Passing for Video Captioning

Multi-View Feature Fusion and Visual Prompt for Remote Sensing Image Captioning

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

Video Captioning With Attention-Based LSTM and Semantic Consistency

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Action-Driven Semantic Representation and Aggregation for Video Captioning