Abstract:An action recognition network that combines multi-level spatiotemporal feature fusion with an attention mechanism is proposed as a solution to the issues of single spatiotemporal feature scale extraction, information redundancy, and insufficient extraction of frequency domain information in channels in 3D convolutional neural networks. Firstly, based on 3D CNN, this paper designs a new multilevel spatiotemporal feature fusion (MSF) structure, which is embedded in the network model, mainly through multilevel spatiotemporal feature separation, splicing and fusion, to achieve the fusion of spatial perceptual fields and short-medium-long time series information at different scales with reduced network parameters; In the second step, a multi-frequency channel and spatiotemporal attention module (FSAM) is introduced to assign different frequency features and spatiotemporal features in the channels are assigned corresponding weights to reduce the information redundancy of the feature maps. Finally, we embed the proposed method into the R3D model, which replaced the 2D convolutional filters in the 2D Resnet with 3D convolutional filters and conduct extensive experimental validation on the small and medium-sized dataset UCF101 and the large-sized dataset Kinetics-400. The findings revealed that our model increased the recognition accuracy on both datasets. Results on the UCF101 dataset, in particular, demonstrate that our model outperforms R3D in terms of a maximum recognition accuracy improvement of 7.2% while using 34.2% fewer parameters. The MSF and FSAM are migrated to another traditional 3D action recognition model named C3D for application testing. The test results based on UCF101 show that the recognition accuracy is improved by 8.9%, proving the strong generalization ability and universality of the method in this paper.

3D Residual Networks with Channel-Spatial Attention Module for Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

DC3D: A Video Action Recognition Network Based on Dense Connection

Spatio-Temporal Attention Networks for Action Recognition and Detection

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Recurrent Attention Network Using Spatial-Temporal Relations for Action Recognition

An efficient attention module for 3d convolutional neural networks in action recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition.

Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Spatiotemporal Residual Networks for Video Action Recognition

Cascading Spatio-Temporal Attention Network for Real-Time Action Detection

Action recognition method based on a novel keyframe extraction method and enhanced 3D convolutional neural network

Select and Focus: Action Recognition with Spatial-Temporal Attention

MSF-Net: A Multilevel Spatiotemporal Feature Fusion Network Combines Attention for Action Recognition

Joint Network based Attention for Action Recognition

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network.