Abstract:The aim of temporal action localization (TAL) is to determine the start and end frames of an action in a video. In recent years, TAL has attracted considerable attention because of its increasing applications in video understanding and retrieval. However, precisely estimating the duration of an action in the temporal dimension is still a challenging problem. In this paper, we propose an effective one‐stage TAL method based on a self‐defined motion data structure, called a dense joint motion matrix (DJMM), and a novel temporal detection strategy. Our method provides three main contributions. First, compared with mainstream motion images, DJMMs can preserve more pre‐processed motion features and provides more precise detail representations. Furthermore, DJMMs perfectly solve the temporal information loss problem caused by motion trajectory overlaps within a certain time period. Second, a spatial pyramid pooling (SPP) layer, which is widely used in the object detection and tracking fields, is innovatively incorporated into the proposed method for multi‐scale feature learning. Moreover, the SPP layer enables the backbone convolutional neural network (CNN) to receive DJMMs of any size in the temporal dimension. Third, a large‐scale‐first temporal detection strategy inspired by a well‐developed Chinese text segmentation algorithm is proposed to address long‐duration videos. Our method is evaluated on two benchmark data sets and one self‐collected data set: Florence‐3D, UTKinect‐Action3D and HanYue‐3D. The experimental results show that our method achieves competitive action recognition accuracy and high TAL precision, and its time efficiency and few‐shot learning capabilities enable it to be utilized for real‐time surveillance.

Cross Time-Frequency Transformer for Temporal Action Localization

Gated Multi-Scale Transformer for Temporal Action Localization

Temporal Deformable Transformer for Action Localization

TALLFormer: Temporal Action Localization with a Long-memory Transformer

Temporal Action Localization with Cross Layer Task Decoupling and Refinement

Multi-granularity transformer fusion for temporal action localization

MTSN: Multiscale Temporal Similarity Network for Temporal Action Localization

A Novel Temporal Channel Enhancement and Contextual Excavation Network for Temporal Action Localization

Efficient Temporal Action Localization with Temporal Attention and Gaussian Weight.

Multi‐scale feature learning and temporal probing strategy for one‐stage temporal action localization

HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers

PCG-TAL: Progressive Cross-Granularity Cooperation for Temporal Action Localization

Action Sensitivity Learning for Temporal Action Localization

Learnable Feature Augmentation Framework for Temporal Action Localization

Advancing Temporal Action Localization with a Boundary Awareness Network

ConvTransformer Attention Network for temporal action detection

ActionFormer: Localizing Moments of Actions with Transformers

BRMR: TAL Based on Boundary Refinement and Multi-scale Regression.

Enriching Local and Global Contexts for Temporal Action Localization.

Learning Disentangled Classification and Localization Representations for Temporal Action Localization

Gaussian Temporal Awareness Networks for Action Localization