Multimodal Gesture Recognition with Spatio-Temporal Features Fusion Based on YOLOv5 and MediaPipe

Wenyi Cao,Peiqi Lu,Wenxin Cao
DOI: https://doi.org/10.1142/s0218001424550073
IF: 1.261
2024-06-30
International Journal of Pattern Recognition and Artificial Intelligence
Abstract:International Journal of Pattern Recognition and Artificial Intelligence, Ahead of Print. As a natural, intuitive and easy-to-learn mode of interaction, gesture plays an important role in communication. Hand detection, containing multimodal information, includes static and dynamic detection and involves intricate spatial relationship problems such as different hand sizes, complex joints, occlusion and self-occlusion. This study focused on a multimodal hand gesture recognition system based on YOLOv5 and MediaPipe with fused spatio-temporal features. First, the Mediapipe and OpenCV libraries were employed to implement hand keypoint detection. Subsequently, the human–computer interaction (HCI) of volume control was realized by identifying the distance between thumb and index. Finally, model training was conducted based on the YOLOv5 algorithm, and the recognition of different gesture categories was realized. The performance was evaluated and compared through YOLOv5s, YOLOv5m, and YOLOv5l. The gesture recognition system interface visualization was achieved through pyqt5. Experiments show that the average detection accuracy of the model is 99.4% and the recognition speed is around 0.2[math]s.
computer science, artificial intelligence
What problem does this paper attempt to address?