Abstract:Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

META: Motion Excitation with Temporal Attention for Compressed Video Action Recognition

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

Compressed Video Action Recognition Using Motion Vector Representation.

Temporal Interaction and Excitation for Action Recognition

Representation Learning for Compressed Video Action Recognition Via Attentive Cross-modal Interaction with Motion Enhancement.

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

TEINet: Towards an Efficient Architecture for Video Recognition.

Human Action Recognition Method Based on Motion Excitation and Temporal Aggregation Module.

An Efficient Motion Visual Learning Method for Video Action Recognition

Compressed Video Action Recognition with Dual-Stream and Dual-Modal Transformer

Joint Feature Optimization and Fusion for Compressed Action Recognition

TSI: Temporal Saliency Integration for Video Action Recognition

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Integrating Temporal and Spatial Attention for Video Action Recognition

Action Recognition with a Multi-View Temporal Attention Network

LAE-Net: Light and Efficient Network for Compressed Video Action Recognition

Alignment-guided Temporal Attention for Video Action Recognition

Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling.

Multipath Attention and Adaptive Gating Network for Video Action Recognition

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

MTRFN: Multiscale Temporal Receptive Field Network for Compressed Video Action Recognition at Edge Servers