Abstract:Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer-based network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. Furthermore, to model co-occurrence action relationships, current methods explicitly embed class relations into the transformer network. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. We also overcome this challenge by introducing a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies without imposing their additional computational costs during inference. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results.

Dual Temporal Transformers for Fine-Grained Dangerous Action Recognition

Temporal Distinct Representation Learning for Action Recognition

Dynamic Spatio-Temporal Specialization Learning for Fine-Grained Action Recognition

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

Harnessing Temporal Causality for Advanced Temporal Action Detection

Temporal Action Localization with Enhanced Instant Discriminability

Temporal Dynamic Graph LSTM for Action-driven Video Object Detection

Campus Abnormal Behavior Recognition With Temporal Segment Transformers

TDN: Temporal Difference Networks for Efficient Action Recognition

DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos

Action Recognition by Hidden Temporal Models

STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Spatial-temporal hypergraph based on dual-stage attention network for multi-view data lightweight action recognition

An Effective-Efficient Approach for Dense Multi-Label Action Detection

Dual DETRs for Multi-Label Temporal Action Detection

Temporal Transformer Networks with Self-Supervision for Action Recognition.

Video Based Action Recognition Using Spatial and Temporal Feature

Student Dangerous Behavior Detection in School

Gated Multi-Scale Transformer for Temporal Action Localization

STDM-transformer: Space-time dual multi-scale transformer network for skeleton-based action recognition

Dual Deep Learning Network for Abnormal Action Detection