Abstract:Video moment localization stands as a crucial task within the realm of computer vision, entailing the identification of temporal moments in untrimmed videos that bear semantic relevance to the supplied natural language queries. This work delves into a relatively unexplored facet of the task: the transferability of video moment localization models. This concern is addressed by evaluating moment localization models within a cross-domain transfer setting. In this setup, we curate multiple datasets distinguished by substantial domain gaps. The model undergoes training on one of these datasets, while validation and testing are executed using the remaining datasets. To confront the challenges inherent in this scenario, we draw inspiration from the recently introduced large-scale pre-trained vision-language models. Our focus is on exploring how the strategic utilization of these resources can bolster the capabilities of a model designed for video moment localization. Nevertheless, the distribution of language queries in video moment localization usually diverges from the text used by pre-trained models, exhibiting distinctions in aspects such as length, content, expression, and more. To mitigate the gap, this work proposes a Moment-Guided Query Prompting (MGQP) method for video moment localization. Our key idea is to generate multiple distinct and complementary prompt primitives through stratification of the original queries. Our approach is comprised of a prompt primitive constructor, a multimodal prompt refiner, and a holistic prompt incorporator. We carry out extensive experiments on Charades-STA, TACoS, DiDeMo, and YouCookII datasets, and investigate the efficacy of the proposed method using various pre-trained models, such as CLIP, ActionCLIP, CLIP4Clip, and VideoCLIP. The experimental results demonstrate the effectiveness of our proposed method.

Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization

Are Binary Annotations Sufficient? Video Moment Retrieval Via Hierarchical Uncertainty-Based Active Learning

Natural Language Video Localization with Learnable Moment Proposals

Attentive Moment Retrieval in Videos

Video Moment Localization via Deep Cross-Modal Hashing

Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Video Moment Retrieval with Noisy Labels

Weakly-Supervised Video Moment Retrieval Via Semantic Completion Network

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Weakly-Supervised Video Moment Retrieval via Regularized Two-Branch Proposal Networks with Erasing Mechanism

Multi-scale 2D Representation Learning for weakly-supervised moment retrieval

Siamese Alignment Network for Weakly Supervised Video Moment Retrieval

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos.

Transferable Video Moment Localization by Moment-Guided Query Prompting

Temporal Textual Localization in Video Via Adversarial Bi-Directional Interaction Networks

A Survey on Video Moment Localization

Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization

Filling the Information Gap Between Video and Query for Language-Driven Moment Retrieval

Cross-modal Moment Localization in Videos.