Abstract:Video moment localization stands as a crucial task within the realm of computer vision, entailing the identification of temporal moments in untrimmed videos that bear semantic relevance to the supplied natural language queries. This work delves into a relatively unexplored facet of the task: the transferability of video moment localization models. This concern is addressed by evaluating moment localization models within a cross-domain transfer setting. In this setup, we curate multiple datasets distinguished by substantial domain gaps. The model undergoes training on one of these datasets, while validation and testing are executed using the remaining datasets. To confront the challenges inherent in this scenario, we draw inspiration from the recently introduced large-scale pre-trained vision-language models. Our focus is on exploring how the strategic utilization of these resources can bolster the capabilities of a model designed for video moment localization. Nevertheless, the distribution of language queries in video moment localization usually diverges from the text used by pre-trained models, exhibiting distinctions in aspects such as length, content, expression, and more. To mitigate the gap, this work proposes a Moment-Guided Query Prompting (MGQP) method for video moment localization. Our key idea is to generate multiple distinct and complementary prompt primitives through stratification of the original queries. Our approach is comprised of a prompt primitive constructor, a multimodal prompt refiner, and a holistic prompt incorporator. We carry out extensive experiments on Charades-STA, TACoS, DiDeMo, and YouCookII datasets, and investigate the efficacy of the proposed method using various pre-trained models, such as CLIP, ActionCLIP, CLIP4Clip, and VideoCLIP. The experimental results demonstrate the effectiveness of our proposed method.

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

Attentive Moment Retrieval in Videos

Query As Supervision: Towards Low-Cost and Robust Video Moment and Highlight Retrieval

Multi-Level Query Interaction for Temporal Language Grounding

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Moment Retrieval via Cross-Modal Interaction Networks with Query Reconstruction

QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval

Transferable Video Moment Localization by Moment-Guided Query Prompting

Heterogeneous Interactive Graph Network for Audio-Visual Question Answering

DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization

Video Moment Retrieval Via Comprehensive Relation-Aware Network

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Query-Biased Self-Attentive Network for Query-Focused Video Summarization

Hierarchical Video-Moment Retrieval and Step-Captioning

Convolutional Hierarchical Attention Network for Query-Focused Video Summarization.

Multi-Modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query

Collaborative Spatial-Temporal Interaction for Language-Based Moment Retrieval

Long-Term Video Question Answering Via Multimodal Hierarchical Memory Attentive Networks

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos