Abstract:Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query. For VMR, several methods that require full supervision for training have been proposed. Unfortunately, acquiring a large number of training videos with labeled temporal boundaries for each query is a labor-intensive process. This paper explores methods for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video. Existing methods on wVMR generate multi-scale proposals and apply query-guided attention mechanisms to highlight the most relevant proposal. To leverage the weak supervision, contrastive learning is used which predicts higher scores for the correct video-query pairs than for the incorrect pairs. It has been observed that a large number of candidate proposals, coarse query representation, and one-way attention mechanism lead to blurry attention maps which limit the localization performance. To handle this issue, Video-Language Alignment Network (VLANet) is proposed that learns sharper attention by pruning out spurious candidate proposals and applying a multi-directional attention mechanism with fine-grained query representation. The Surrogate Proposal Selection module selects a proposal based on the proximity to the query in the joint embedding space, and thus substantially reduces candidate proposals which leads to lower computation load and sharper attention. Next, the Cascaded Cross-modal Attention module considers dense feature interactions and multi-directional attention flow to learn the multi-modal alignment. VLANet is trained end-to-end using contrastive loss which enforces semantically similar videos and queries to gather. The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.

Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization.

Boundary Proposal Network for Two-Stage Natural Language Video Localization

Scene-robust Natural Language Video Localization Via Learning Domain-invariant Representations

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

CVLNet: Cross-View Semantic Correspondence Learning for Video-based Camera Localization

UnLoc: A Unified Framework for Video Localization Tasks

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

Natural Language Video Localization with Learnable Moment Proposals

Jointly Modeling Embedding and Translation to Bridge Video and Language

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework

Unified Lexical Representation for Interpretable Visual-Language Alignment

Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks

WSLLN: Weakly Supervised Natural Language Localization Networks

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Unified Video-Language Pre-training with Synchronized Audio

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation