Word-level Sign Language Recognition with Multi-stream Neural Networks Focusing on Local Regions and Skeletal Information

Mizuki Maruyama,Shrey Singh,Katsufumi Inoue,Partha Pratim Roy,Masakazu Iwamura,Michifumi Yoshioka
DOI: https://doi.org/10.1109/ACCESS.2024.3494878
2024-11-20
Abstract:Word-level sign language recognition (WSLR) has attracted attention because it is expected to overcome the communication barrier between people with speech impairment and those who can hear. In the WSLR problem, a method designed for action recognition has achieved the state-of-the-art accuracy. Indeed, it sounds reasonable for an action recognition method to perform well on WSLR because sign language is regarded as an action. However, a careful evaluation of the tasks reveals that the tasks of action recognition and WSLR are inherently different. Hence, in this paper, we propose a novel WSLR method that takes into account information specifically useful for the WSLR problem. We realize it as a multi-stream neural network (MSNN), which consist of three streams: 1) base stream, 2) local image stream, and 3) skeleton stream. Each stream is designed to handle different types of information. The base stream deals with quick and detailed movements of the hands and body, the local image stream focuses on handshapes and facial expressions, and the skeleton stream captures the relative positions of the body and both hands. This approach allows us to combine various types of data for more comprehensive gesture analysis. Experimental results on the WLASL and MS-ASL datasets show the effectiveness of the proposed method; it achieved an improvement of approximately 10\%--15\% in Top-1 accuracy when compared with conventional methods.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?