Abstract:Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

TSRN: Two-Stage Refinement Network for Temporal Action Segmentation

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

Temporal Segment Transformer for Action Segmentation

Temporal Segment Networks for Action Recognition in Videos

Temporal Action Detection with Structured Segment Networks

Involving Distinguished Temporal Graph Convolutional Networks for Skeleton-Based Temporal Action Segmentation

SG-TCN: Semantic Guidance Temporal Convolutional Network for Action Segmentation.

Temporal Distinct Representation Learning for Action Recognition

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Temporal Transformer Networks with Self-Supervision for Action Recognition.

Boundary Information Matters More: Accurate Temporal Action Detection with Temporal Boundary Network

Gated forward refinement network for action segmentation

Modeling Long-Term Video Semantic Distribution for Temporal Action Proposal Generation

Streaming Video Temporal Action Segmentation In Real Time

A motion-aware and temporal-enhanced Spatial–Temporal Graph Convolutional Network for skeleton-based human action segmentation

Sequential Segment Networks for Action Recognition

Neighbor-Guided Pseudo-Label Generation and Refinement for Single-Frame Supervised Temporal Action Localization

C2F-TCN: A Framework for Semi and Fully Supervised Temporal Action Segmentation

A Temporal-Aware Relation and Attention Network for Temporal Action Localization

Truncated Attention-Aware Proposal Networks with Multi-Scale Dilation for Temporal Action Detection

SCALE MATTERS: TEMPORAL SCALE AGGREGATION NETWORK FOR PRECISE ACTION LOCALIZATION IN UNTRIMMED VIDEOS