Abstract:Natural language moment localization aims to localize the target moment that matches a given natural language query in an untrimmed video. The key to this challenging task is to capture fine-grained video-language correlations to establish the alignment between the query and target moment. Most existing works establish a single-pass interaction schema to capture correlations between queries and moments. Considering the complex feature space of lengthy video and diverse information between frames, the weight distribution of information interaction flow is prone to dispersion or misalignment, which leads to redundant information flow affecting the final prediction. We address this issue by proposing a capsule-based approach to model the query-video interactions, termed the Multimodal, Multichannel, and Dual-step Capsule Network ( [Formula: see text]DCapsN), which is derived from the intuition that "multiple people viewing multiple times is better than one person viewing one time." First, we introduce a multimodal capsule network, replacing the single-pass interaction schema of "one person viewing one time" with the iterative interaction schema of "one person viewing multiple times," which cyclically updates cross-modal interactions and modifies potential redundant interactions via its routing-by-agreement. Then, considering that the conventional routing mechanism only learns a single iterative interaction schema, we further propose a multichannel dynamic routing mechanism to learn multiple iterative interaction schemas, where each channel performs independent routing iteration to collectively capture cross-modal correlations from multiple subspaces, that is, "multiple people viewing." Moreover, we design a dual-step capsule network structure based on the multimodal, multichannel capsule network, bringing together the query and query-guided key moments to jointly enhance the original video, so as to select the target moments according to the enhanced part. Experimental results on three public datasets demonstrate the superiority of our approach in comparison with state-of-the-art methods, and comprehensive ablation and visualization analysis validate the effectiveness of each component of the proposed model.

Dual Path Interaction Network for Video Moment Localization

Structured Multi-Level Interaction Network for Video Moment Localization via Language Query

Interaction-Integrated Network for Natural Language Moment Localization.

Boundary Proposal Network for Two-Stage Natural Language Video Localization

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Temporal Textual Localization in Video Via Adversarial Bi-Directional Interaction Networks

Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction

M2DCapsN: Multimodal, Multichannel, and Dual-Step Capsule Network for Natural Language Moment Localization

Cross-modal Moment Localization in Videos.

Rethinking the Bottom-Up Framework for Query-Based Video Localization

Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization

Dual relation network for temporal action localization

Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query

A Hybird Alignment Loss for Temporal Moment Localization with Natural Language

Semantic Collaborative Learning for Cross-Modal Moment Localization

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.

Moment Retrieval via Cross-Modal Interaction Networks with Query Reconstruction

Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification

Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

Disentangle and denoise: Tackling context misalignment for video moment retrieval

Natural Language Video Localization with Learnable Moment Proposals