Skeleton-based Human Action Recognition by Fusing Attention Based Three-Stream Convolutional Neural Network and SVM.
Fang Ren,Chao Tang,Anyang Tong,Wenjian Wang
DOI: https://doi.org/10.1007/s11042-023-15334-9
IF: 2.577
2023-01-01
Multimedia Tools and Applications
Abstract:This work proposes a method, aiming the 3D skeleton sequence, for the human action recognition by fusing the attention-based three-stream convolutional neural network and support vector machine. The traditional action recognition methods primarily employ RGB video as input. However, RGB video has issues with respect to large data volume, low semanticity, and ease of making the model interfered by irrelevant information such as the background. The efficient and advanced human action information contained in the 3D skeleton sequence facilitates human behavior recognition. First, the information of 3D coordinates, temporal-difference information, and spatial-difference information of joints are extracted from the raw skeleton data, and the above information is input into the respective convolutional neural networks for pre-training. Then, the pre-trained network model extracts the feature containing the spatial-temporal information. Finally, the mixed feature vectors are input into the support vector machine for training and classification. Under the X-View and X-Sub benchmarks, the accuracy on the open dataset NTU RGB+D is 92.6% and 86.7% respectively, demonstrating that the method proposed for incorporating multistream feature learning, feature fusing, and hybrid model can improve the recognition accuracy.