Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

Wang Y.,Shen X. J.,Chen H. P.,Sun J. X.
DOI: https://doi.org/10.1134/s105466182103024x
2021-01-01
Pattern Recognition and Image Analysis
Abstract:Feature extraction based traditional human action recognition algorithms are complicated, leading to low recognition accuracy. We present an algorithm for the recognition of human actions in videos based on spatio-temporal fusion using 3D convolutional neural networks (3D CNNs). The algorithm contains two subnetworks, which extract deep spatial information and temporal information, respectively, and bilinear fusion policy is applied to obtain the final fused spatio-temporal information. Spatial information is represented by a gradient feature, and the temporal information is represented by optical flow. The fused spatio-temporal information can retrieve deep features from multiple angles by constructing a new 3D CNNs. The proposed algorithm is compared with the current mainstream algorithms in the KTH and UCF101 datasets, showing effectiveness and high recognition accuracy.
What problem does this paper attempt to address?