Abstract:Natural language moment localization aims to localize the target moment that matches a given natural language query in an untrimmed video. The key to this challenging task is to capture fine-grained video-language correlations to establish the alignment between the query and target moment. Most existing works establish a single-pass interaction schema to capture correlations between queries and moments. Considering the complex feature space of lengthy video and diverse information between frames, the weight distribution of information interaction flow is prone to dispersion or misalignment, which leads to redundant information flow affecting the final prediction. We address this issue by proposing a capsule-based approach to model the query-video interactions, termed the Multimodal, Multichannel, and Dual-step Capsule Network ( [Formula: see text]DCapsN), which is derived from the intuition that "multiple people viewing multiple times is better than one person viewing one time." First, we introduce a multimodal capsule network, replacing the single-pass interaction schema of "one person viewing one time" with the iterative interaction schema of "one person viewing multiple times," which cyclically updates cross-modal interactions and modifies potential redundant interactions via its routing-by-agreement. Then, considering that the conventional routing mechanism only learns a single iterative interaction schema, we further propose a multichannel dynamic routing mechanism to learn multiple iterative interaction schemas, where each channel performs independent routing iteration to collectively capture cross-modal correlations from multiple subspaces, that is, "multiple people viewing." Moreover, we design a dual-step capsule network structure based on the multimodal, multichannel capsule network, bringing together the query and query-guided key moments to jointly enhance the original video, so as to select the target moments according to the enhanced part. Experimental results on three public datasets demonstrate the superiority of our approach in comparison with state-of-the-art methods, and comprehensive ablation and visualization analysis validate the effectiveness of each component of the proposed model.

Graph Capsule Aggregation for Unaligned Multimodal Sequences

Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion

Multimodal Graph for Unaligned Multimodal Sequence Analysis via Graph Convolution and Graph Pooling

Capturing High-Level Semantic Correlations via Graph for Multimodal Sentiment Analysis

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Robust triple extraction with cascade bidirectional capsule network

Hybrid Gromov-Wasserstein Embedding for Capsule Learning

M2DCapsN: Multimodal, Multichannel, and Dual-Step Capsule Network for Natural Language Moment Localization

Graph convolutional network with interactive memory fusion for aspect-based sentiment analysis

Sentiment Analysis by Capsules

CMGNet: Collaborative multi-modal graph network for video captioning

Sentiment Analysis Using Multi-Head Attention Capsules With Multi-Channel CNN and Bidirectional GRU

Federated Capsule Graph Neural Network with Personalization.

Multi-Channel Attentive Graph Convolutional Network with Sentiment Fusion for Multimodal Sentiment Analysis

Multi-Modal Graph Aggregation Transformer for image captioning

Heterogeneous graphormer for extractive multimodal summarization

Spiking CapsNet: A spiking neural network with a biologically plausible routing rule between capsules

Towards Linear Time Neural Machine Translation with Capsule Networks.

Targeted Aspect-Based Multimodal Sentiment Analysis: An Attention Capsule Extraction and Multi-Head Fusion Network

GCM-Net: Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network for Video Sentiment and Emotion Analysis

Capsule-Transformer for Neural Machine Translation