Abstract:Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

Person-level Action Recognition in Complex Events Via TSD-TSM Networks.

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes

Temporal Segment Networks for Action Recognition in Videos

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Human Action Recognition From Digital Videos Based on Deep Learning.

Deep Learning-Based Real-Time Multiple-Person Action Recognition System

Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor.

MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos

Temporal-Spatial Mapping for Action Recognition

STA-TSN: Spatial-Temporal Attention Temporal Segment Network for Action Recognition in Video.

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Faster-TAD: Towards Temporal Action Detection with Proposal Generation and Classification in a Unified Network

Spatiotemporal Multi-Task Network for Human Activity Understanding.

Action Recognition Based on Object Tracking and Dense Trajectories

Campus Abnormal Behavior Recognition With Temporal Segment Transformers

Action Recognition and Localization with Instance FCNN

MCMNET: Multi-Scale Context Modeling Network for Temporal Action Detection

A multidimensional feature fusion network based on MGSE and TAAC for video-based human action recognition

Spatial-Temporal Hypergraph Neural Network based on Attention Mechanism for Multi-view Data Action Recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition