ST-TGR: Spatio-Temporal Representation Learning for Skeleton-Based Teaching Gesture Recognition

Zengzhao Chen,Wenkai Huang,Hai Liu,Zhuo Wang,Yuqun Wen,Shengming Wang
DOI: https://doi.org/10.3390/s24082589
IF: 3.9
2024-04-19
Sensors
Abstract:Teaching gesture recognition is a technique used to recognize the hand movements of teachers in classroom teaching scenarios. This technology is widely used in education, including for classroom teaching evaluation, enhancing online teaching, and assisting special education. However, current research on gesture recognition in teaching mainly focuses on detecting the static gestures of individual students and analyzing their classroom behavior. To analyze the teacher's gestures and mitigate the difficulty of single-target dynamic gesture recognition in multi-person teaching scenarios, this paper proposes skeleton-based teaching gesture recognition (ST-TGR), which learns through spatio-temporal representation. This method mainly uses the human pose estimation technique RTMPose to extract the coordinates of the keypoints of the teacher's skeleton and then inputs the recognized sequence of the teacher's skeleton into the MoGRU action recognition network for classifying gesture actions. The MoGRU action recognition module mainly learns the spatio-temporal representation of target actions by stacking a multi-scale bidirectional gated recurrent unit (BiGRU) and using improved attention mechanism modules. To validate the generalization of the action recognition network model, we conducted comparative experiments on datasets including NTU RGB+D 60, UT-Kinect Action3D, SBU Kinect Interaction, and Florence 3D. The results indicate that, compared with most existing baseline models, the model proposed in this article exhibits better performance in recognition accuracy and speed.
engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the difficulty of single - target dynamic gesture recognition in multi - person teaching scenarios. Specifically, current gesture recognition research mainly focuses on detecting static gestures of individual students and analyzing their classroom behaviors, while ignoring the potential impact of teachers' gestural actions on classroom teaching behaviors. In order to analyze teachers' gestures and reduce the difficulty of single - target dynamic gesture recognition in multi - person teaching scenarios, this paper proposes a skeleton - based key - point gesture recognition algorithm (ST - TGR), which realizes the recognition of teachers' gestures through spatio - temporal representation learning. ### Main Contributions 1. **Gesture Recognition Algorithm Based on Skeleton Key - points**: This method mainly extracts the skeleton key - point coordinates of the target through human pose estimation technology, and then inputs information sequences of different scales into the subsequent gesture recognition module for gesture action classification. 2. **Efficient Action Recognition Network Module MoGRU**: This paper proposes a simple and efficient action recognition network module MoGRU, which integrates a multi - scale bidirectional GRU module and an improved attention mechanism module. It can achieve good action classification performance when using only target skeleton information on different benchmark action datasets, especially when dealing with small - sample datasets. In addition, this module achieves a good balance between recognition speed and recognition accuracy, bringing possibilities for practical applications. 3. **Construction of Teaching Gesture Action Dataset (TGAD)**: In order to promote the application of gesture recognition in teaching, this paper constructs a teaching gesture action dataset (TGAD) based on real classroom teaching scenarios, including four teaching gesture actions from different perspectives, with a total of 400 samples. After model testing, the recognition accuracy of the proposed method on this dataset can reach 93.5%. ### Method Overview 1. **Skeleton Key - point Extraction**: - Use the high - performance human pose estimation detector RTMPose, based on the MMPose algorithm library, to identify the skeleton key - points of teachers from classroom teaching videos. - RTMPose adopts a "top - down" mode, using a pre - trained detector to obtain bounding boxes, and then estimates the posture of each person respectively. This method has higher recognition accuracy in complex classroom environments. - In order to accelerate the inference speed, RTMPose adopts CSPNeXt as the backbone structure, and uses techniques such as skip - frame detection strategy, non - maximum suppression and smoothing filtering to improve the robustness of pose processing. 2. **Gesture Action Classification**: - Construct a new MoGRU action recognition network model, including a three - layer bidirectional GRU module, a multi - layer CNN module and an improved multi - head self - attention module. - Transform the frame - related key - point information formed by the original prediction through the multi - layer GRU module to generate a feature vector containing the time information of teachers' gesture actions. - Use the convolutional neural network modules of different scales to extract the spatial information between key - points within the same frame time, enhancing the model's understanding of the correlation between key - points. - Further enhance the spatio - temporal information features through the improved multi - head self - attention mechanism module, and finally input the fused spatio - temporal information feature vector into the fully - connected layer for softmax classification prediction. ### Experimental Results and Analysis 1. **Datasets**: - **NTU RGB+D**: This is a large - scale RGB - D human action recognition dataset, containing 60 actions and a total of 56,880 samples. - **UT - Kinect Action3D**: This is a dataset for 3D action recognition. - **SBU Kinect Interaction**: This is a dataset for interactive action recognition. - **Florence 3D**: This is a dataset for 3D action recognition. 2. **Experimental Results**: - The model proposed in this paper shows better recognition accuracy and speed than most existing baseline models on multiple benchmark datasets. - Especially on small - sample datasets, the MoGRU module shows its good balance between recognition speed and accuracy. Through these methods and experimental results, this paper effectively solves the problem of single - target dynamic gesture recognition in multi - person teaching scenarios, providing strong support for the application of teaching gesture recognition technology in the education field.