Abstract:Video temporal grounding is a challenging task in computer vision that involves localizing a video segment semantically related to a given query from a set of videos and queries. In this paper, we propose a novel weakly-supervised model called the Multi-level Attentional Reconstruction Networks (MARN), which is trained on video-sentence pairs. During the training phase, we leverage the idea of attentional reconstruction to train an attention map that can reconstruct the given query. At inference time, proposals are ranked based on attention scores to localize the most suitable segment. In contrast to previous methods, MARN effectively aligns video-level supervision and proposal scoring, thereby reducing the training-inference discrepancy. In addition, we incorporate a multi-level framework that encompasses both proposal-level and clip-level processes. The proposal-level process generates and scores variable-length time sequences, while the clip-level process generates and scores fix-length time sequences to refine the predicted scores of the proposal in both training and testing. To improve the feature representation of the video, we propose a novel representation mechanism that utilizes intra-proposal information and adopts 2D convolution to extract inter-proposal clues for learning reliable attention maps. By accurately representing these proposals, we can better align them with the textual modalities, and thus facilitate the learning of the model. Our proposed MARN is evaluated on two benchmark datasets, and extensive experiments demonstrate its superiority over existing methods.

Multi-attention Networks for Temporal Localization of Video-level Labels

Multi-Group Multi-Attention

Exploring the Consistency of Segment-level and Video-level Predictions for Improved Temporal Concept Localization in Videos

Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation

Learning to Localize Temporal Events in Large-scale Video Data

Temporal Textual Localization in Video Via Adversarial Bi-Directional Interaction Networks

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

A Multi-Scale Spatial-Temporal Attention Model for Person Re-Identification in Videos

Exploiting Temporal Relationships in Video Moment Localization with Natural Language

Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos

Context-aware focal alignment network for micro-video multi-label classification

Motion-Guided Spatial Time Attention for Video Object Segmentation.

MARN: Multi-level Attentional Reconstruction Networks for Weakly Supervised Video Temporal Grounding

Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Integrating Temporal and Spatial Attention for Video Action Recognition

Weakly-Supervised Video Re-Localization with Multiscale Attention Model

Multi-Scale Temporal Relations and Segmented Channel Attention for Video Anomaly Detection

A Joint Model for Action Localization and Classification in Untrimmed Video with Visual Attention

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language