Abstract:With the maturity of intelligent technology such as human–computer interaction, human action recognition (HAR) technology has been widely used in virtual reality, video surveillance, and other fields. However, the current video-based HAR methods still cannot fully extract abstract action features, and there is still a lack of action collection and recognition for special personnel such as prisoners and elderly people living alone. To solve the above problems, this paper proposes a multidimensional feature fusion network, called P-MTSC3D, a parallel network based on context modeling and temporal adaptive attention module. It consists of three branches. The first branch serves as the basic network branch, which extracts basic feature information. The second branch consists of a feature pre-extraction layer and two multiscale-convolution-based global context modeling combined squeeze and excitation (MGSE) modules, which can extract spatial and channel features. The third branch consists of two temporal adaptive attention units based on convolution (TAAC) to extract temporal dimension features. In order to verify the validity of the proposed network, this paper conducts experiments on the University of Central Florida (UCF) 101 dataset and the human motion database (HMDB) 51 dataset. The recognition accuracy of the proposed P-MTSC3D network is 97.92% on the UCF101 dataset and 75.59% on the HMDB51 dataset, respectively. The FLOPs of the P-MTSC3D network is 30.85G, and the test time is 2.83s/16 samples on the UCF101 dataset. The experimental results demonstrate that the P-MTSC3D network has better overall performance than the state-of-the-art networks. In addition, a prison action (PA) dataset is constructed in this paper to verify the application effect of the proposed network in actual scenarios.

Spatiotemporal Multi-Task Network for Human Activity Understanding.

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

A fast human action recognition network based on spatio-temporal features

Spatio-Temporal Attention Networks for Action Recognition and Detection

A multidimensional feature fusion network based on MGSE and TAAC for video-based human action recognition

Hierarchical Multi-View Aggregation Network for Sensor-Based Human Activity Recognition.

Temporal-Spatial Mapping for Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Spatio-Temporal Fusion Networks for Action Recognition

Spatial-Temporal Hypergraph Neural Network based on Attention Mechanism for Multi-view Data Action Recognition

Unified Spatio-Temporal Attention Models for Advanced Human Action Recognition & Detection

Human Action Recognition Based on Hierarchical Multi-Scale Adaptive Conv-Long Short-Term Memory Network

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition

Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection

Human Activity Recognition based on Dynamic Spatio-Temporal Relations

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

Real-time spatiotemporal action localization algorithm using improved CNNs architecture

A Hierarchical Spatio-Temporal Model for Human Activity Recognition.