Abstract:Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

Concept Grounding with Modular Action-Capsules in Semantic Video Prediction

Modular Action Concept Grounding in Semantic Video Prediction

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Object-centric Video Representation for Long-term Action Anticipation

Action-conditioned video data improves predictability

Object-Centric Cross-Modal Knowledge Reasoning for Future Event Prediction in Videos

Look Before you Speak: Visually Contextualized Utterances

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos

Video Action Recognition with Attentive Semantic Units

Enhancing early action prediction in videos through temporal composition of sub-actions

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Multi-Label Action Anticipation for Real-World Videos with Scene Understanding

Dynamic Context Removal: A General Training Strategy for Robust Models on Video Action Predictive Tasks

Object-centric Video Prediction without Annotation

Multi-modal Capsule Routing for Actor and Action Video Segmentation Conditioned on Natural Language Queries

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction.

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos