Abstract:With the maturity of intelligent technology such as human–computer interaction, human action recognition (HAR) technology has been widely used in virtual reality, video surveillance, and other fields. However, the current video-based HAR methods still cannot fully extract abstract action features, and there is still a lack of action collection and recognition for special personnel such as prisoners and elderly people living alone. To solve the above problems, this paper proposes a multidimensional feature fusion network, called P-MTSC3D, a parallel network based on context modeling and temporal adaptive attention module. It consists of three branches. The first branch serves as the basic network branch, which extracts basic feature information. The second branch consists of a feature pre-extraction layer and two multiscale-convolution-based global context modeling combined squeeze and excitation (MGSE) modules, which can extract spatial and channel features. The third branch consists of two temporal adaptive attention units based on convolution (TAAC) to extract temporal dimension features. In order to verify the validity of the proposed network, this paper conducts experiments on the University of Central Florida (UCF) 101 dataset and the human motion database (HMDB) 51 dataset. The recognition accuracy of the proposed P-MTSC3D network is 97.92% on the UCF101 dataset and 75.59% on the HMDB51 dataset, respectively. The FLOPs of the P-MTSC3D network is 30.85G, and the test time is 2.83s/16 samples on the UCF101 dataset. The experimental results demonstrate that the P-MTSC3D network has better overall performance than the state-of-the-art networks. In addition, a prison action (PA) dataset is constructed in this paper to verify the application effect of the proposed network in actual scenarios.

Multi-scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Mutually Reinforced Spatio-Temporal Convolutional Tube for Human Action Recognition.

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Human Action Recognition Based on Hierarchical Multi-Scale Adaptive Conv-Long Short-Term Memory Network

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network

Spatio-Temporal Attention Networks for Action Recognition and Detection

TEINet: Towards an Efficient Architecture for Video Recognition.

Action Recognition with Multi-Scale Trajectory-Pooled 3D Convolutional Descriptors

STCA: an action recognition network with spatio-temporal convolution and attention

Multi-scale residual network model combined with Global Average Pooling for action recognition

Spatio-Temporal Collaborative Module for Efficient Action Recognition

STM: SpatioTemporal and Motion Encoding for Action Recognition.

Spatiotemporal Multi-Task Network for Human Activity Understanding.

MULTI-DIRECTIONAL CONVOLUTION NETWORKS WITH SPATIAL-TEMPORAL FEATURE PYRAMID MODULE FOR ACTION RECOGNITION

A multidimensional feature fusion network based on MGSE and TAAC for video-based human action recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation