Abstract:Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, \textit{i.e.}, the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with query words in frame-level features while suppressing irrelevant parts. Therefore, the enhanced video contains less redundant semantics and is more balanced with the textual modality. Second, we enhance the textual modality at the segment-sentence level by learning complementary knowledge from context sentences and ground-truth segments. With the knowledge added to the query, the textual modality thus maintains more meaningful semantics and is more balanced with the video modality. By implementing two levels of MESM, the semantic information from both modalities is more balanced to align, thereby bridging the modality gap. Experiments on three widely used benchmarks, including the out-of-distribution settings, show that the proposed framework achieves a new start-of-the-art performance with notable generalization ability (e.g., 4.42% and 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code will be available at <a class="link-external link-https" href="https://github.com/lntzm/MESM" rel="external noopener nofollow">this https URL</a>.

Modal-Enhanced Semantic Modeling for Fine-Grained 3D Human Motion Retrieval

TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

Improving Fine-grained Understanding for Retrieval in Human Motion and Text

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

Motion Generation from Fine-grained Textual Descriptions

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

Cross-Modal Retrieval for Motion and Text via DopTriple Loss

TEMOS: Generating diverse human motions from textual descriptions

SemanticBoost: Elevating Motion Generation with Augmented Textual Cues

Retrieval-Based Natural 3D Human Motion Generation

Cross-Modal Retrieval for Motion and Text Via DropTriple Loss.

A 3D Human Motion Refinement Method Based on Sparse Motion Bases Selection.

FTMoMamba: Motion Generation with Frequency and Text State Space Models

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

Contact-aware Human Motion Generation from Textual Descriptions

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

CoMA: Compositional Human Motion Generation with Multi-modal Agents