Abstract:The aim of temporal action localization (TAL) is to determine the start and end frames of an action in a video. In recent years, TAL has attracted considerable attention because of its increasing applications in video understanding and retrieval. However, precisely estimating the duration of an action in the temporal dimension is still a challenging problem. In this paper, we propose an effective one‐stage TAL method based on a self‐defined motion data structure, called a dense joint motion matrix (DJMM), and a novel temporal detection strategy. Our method provides three main contributions. First, compared with mainstream motion images, DJMMs can preserve more pre‐processed motion features and provides more precise detail representations. Furthermore, DJMMs perfectly solve the temporal information loss problem caused by motion trajectory overlaps within a certain time period. Second, a spatial pyramid pooling (SPP) layer, which is widely used in the object detection and tracking fields, is innovatively incorporated into the proposed method for multi‐scale feature learning. Moreover, the SPP layer enables the backbone convolutional neural network (CNN) to receive DJMMs of any size in the temporal dimension. Third, a large‐scale‐first temporal detection strategy inspired by a well‐developed Chinese text segmentation algorithm is proposed to address long‐duration videos. Our method is evaluated on two benchmark data sets and one self‐collected data set: Florence‐3D, UTKinect‐Action3D and HanYue‐3D. The experimental results show that our method achieves competitive action recognition accuracy and high TAL precision, and its time efficiency and few‐shot learning capabilities enable it to be utilized for real‐time surveillance.

Localizing and recognizing action unit using position information of local feature

A Method of Simultaneously Action Recognition and Video Segmentation of Video Streams.

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Localizing Volumetric Motion for Action Recognition in Realistic Videos

Robust 3D Action Recognition Through Sampling Local Appearances and Global Distributions.

Local Feature Analysis for real-time Action Recognition.

Exploring Probabilistic Localized Video Representation for Human Action Recognition

Fusing $${\mathcal {R}}$$R Features and Local Features with Context-Aware Kernels for Action Recognition

Human Action Recognition in Unconstrained Videos by Explicit Motion Modeling

Human Action Recognition under Log-Euclidean Riemannian Metric.

Action Recognition and Localization with Instance FCNN

Com-STAL: Compositional Spatio-Temporal Action Localization

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Learning facial expression-aware global-to-local representation for robust action unit detection

Action Recognition by Exploring Data Distribution and Feature Correlation

A distribution based video representation for human action recognition

An Approach to Pose-Based Action Recognition

Temporal Action Localization by Structured Maximal Sums

Multi‐scale feature learning and temporal probing strategy for one‐stage temporal action localization

Human Action Recognition Using Multi-Velocity STIPs and Motion Energy Orientation Histogram.

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation