Abstract:In the field of human action recognition (HAR), two-stream models have been widely employed. In recent years, traditional two-stream network models have disregarded the interframe sequence characteristics of video, resulting in a decrease in model robustness when local sequence information and long-term motion information interact. In light of this, a novel three-stream neural network is proposed by combining the long-term and short-term characteristics of a frame sequence with spatio-temporal information. Initially, the optical flow sequence image frames and RGB image frames in the video are extracted, the optical flow motion information and image space information in the video is obtained, the corresponding time network and space network are entered, and the spatial information is entered into the sequence feature processing network; the three networks are then pretrained. At the conclusion of training, the operation of feature extraction is executed, the features are incorporated with the parallel fusion algorithm by adding weights, and the behavior categories are classified using Multi-Layer Perception. Experimental results on the UCF11, UCF50, and HMDB51 datasets demonstrate that our model effectively integrates the spatial-temporal and frame-sequence information of human actions, resulting in a significant improvement in recognition accuracy. Its classification accuracy on the three datasets was 99.17%, 97.40%, and 96.88%, respectively, significantly enhancing the generalization capability and validity of conventional two-stream or three-stream models.

Action Recognition and Localization with Instance FCNN

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Learning Comprehensive Motion Representation for Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Local Feature Analysis for real-time Action Recognition.

Exploring Frame Segmentation Networks for Temporal Action Localization

Joint Network based Attention for Action Recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Real-time spatiotemporal action localization algorithm using improved CNNs architecture

Action Recognition with Joint Attention on Multi-Level Deep Features

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Mining Spatial and Spatio-Temporal ROIs for Action Recognition

Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

A fast human action recognition network based on spatio-temporal features

Action recognition using three dimension convolution and long short term memory

Action recognition method based on a novel keyframe extraction method and enhanced 3D convolutional neural network

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition