Abstract:Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

End to End Alignment Learning of Instructional Videos with Spatiotemporal Hybrid Encoding and Decoding Space Reduction

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

End-to-end Neural Video Coding Using a Compound Spatiotemporal Representation

Alignment-guided Temporal Attention for Video Action Recognition

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

WINNER: Weakly-Supervised Hierarchical Decomposition and Alignment for Spatio-tEmporal Video Grounding

Video alignment using unsupervised learning of local and global features

Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network

Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions.

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Context-Aware Sequence Alignment using 4D Skeletal Augmentation

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

End-to-End Spatio-Temporal Action Localisation with Video Transformers

Video-Language Alignment via Spatio-Temporal Graph Transformer

Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding

Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision

You Only Align Once: Bidirectional Interaction for Spatial-Temporal Video Super-Resolution

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning