Abstract:Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

Searching Action Proposals Via Spatial Actionness Estimation And Temporal Path Inference And Tracking

Detecting Action Tubes Via Spatial Action Estimation and Temporal Path Inference.

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

Detecting Temporal Proposal for Action Localization with Tree-structured Search Policy

An Active Action Proposal Method Based on Reinforcement Learning

Search Video Action Proposal with Recurrent and Static YOLO.

Unsupervised Action Proposal Ranking through Proposal Recombination

Action Extraction in Continuous Unconstrained Video for Cloud-Based Intelligent Service Robot

Spatial–Temporal Context-Aware Online Action Detection and Prediction

Active Temporal Action Detection in Untrimmed Videos Via Deep Reinforcement Learning

CTAP: Complementary Temporal Action Proposal Generation

ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal Generation

Cascaded Boundary Network for High-Quality Temporal Action Proposal Generation

Online Action Tube Detection Via Resolving The Spatio-Temporal Context Pattern

Deep Point-Wise Prediction for Action Temporal Proposal

Play and rewind: Context-aware video temporal action proposals

SAP: Self-Adaptive Proposal Model for Temporal Action Detection Based on Reinforcement Learning

A Proposal-Based Solution to Spatio-Temporal Action Detection in Untrimmed Videos

YoTube: Searching Action Proposal Via Recurrent and Static Regression Networks

Multi-Level Content-Aware Boundary Detection for Temporal Action Proposal Generation

Superframe-Based Temporal Proposals for Weakly Supervised Temporal Action Detection