Abstract:Sign Language Recognition (SLR) represents a revolutionary technology aiming to establish communication between hearing impaired and non-hearing impaired communities, surpassing traditional interpreter-based approaches. Existing efforts in automatic sign recognition predominantly rely on hand skeleton joint information, steering clear of image pixels to address challenges like partial occlusion and redundant backgrounds. Many researchers have been working to develop automatic sign recognition using hand skeleton joint information instead of image pixels to overcome partial occlusion and redundant background problems. However, body motion and facial expression play an essential role in increasing the inner gesture variance in expressing sign language emotion besides hand information for large-scale sign word datasets. Recently, some researchers have been working to develop muti-gesture-based SLR recognition systems, but their performance accuracy and efficiency are unsatisfactory for real-time deployment. Addressing these limitations, we propose a novel approach, a two-stream multistage graph convolution with attention and residual connection (GCAR) designed to extract spatial-temporal contextual information. The multistage GCAR system, incorporating a channel attention module, dynamically enhances attention levels, particularly for non-connected skeleton points during specific events within spatial-temporal features. The methodology involves capturing joint skeleton points and motion, offering a comprehensive understanding of a person's entire body movement during sign language gestures and feeding this information into two streams. In the first stream, joint key features undergo processing through sep-TCN, graph convolution, deep learning layer, and a channel attention module across multiple stages, generating intricate spatial-temporal features in sign language gestures. Simultaneously, the joint motion is processed in the second stream, mirroring the steps of the first branch. The fusion of these two features yields the final feature vector, which is then fed into the classification module. The model excels in capturing discriminative structural displacements and short-range dependencies by leveraging unified joint features projected onto a high-dimensional space. Owing to the effectiveness of these features, the proposed method achieved significant accuracies: 90.31%, 94.10%, 99.75%, and 34.41%, for the WLASL, PSL, MSL, and ASLLVD large-scale datasets, respectively, with 0.69 million parameters. The high-performance accuracy, coupled with stable computational complexity, demonstrates the superiority of the proposed model. This innovative approach is anticipated to redefine the landscape of sign language recognition, setting a new standard in the field.

A Two-Stream CNN Framework for American Sign Language Recognition Based on Multimodal Data Fusion

Recognizing American Sign Language Manual Signs from Rgb-D Videos

Enhancing Sign Language Detection through Mediapipe and Convolutional Neural Networks (CNN)

Sign Language Recognition with Multi-modal Features.

ASL-3DCNN: American sign language recognition technique using 3-D convolutional neural networks

Mediapipe and CNNs for Real-Time ASL Gesture Recognition

Interactive attention and improved GCN for continuous sign language recognition

Automatic American sign language prediction for static and dynamic gestures using KFM-CNN

Study on Gesture Recognition Method with Two-Stream Residual Network Fusing sEMG Signals and Acceleration Signals

Two-Stream Network for Sign Language Recognition and Translation

Convolutional neural network with spatial pyramid pooling for hand gesture recognition

Sign Language Recognition Using Graph and General Deep Neural Network Based on Large Scale Dataset

A multimodal human-robot sign language interaction framework applied in social robots

Active convolutional neural networks sign language (ActiveCNN-SL) framework: a paradigm shift in deaf-mute communication

Sensor Fusion of Motion-Based Sign Language Interpretation with Deep Learning

Attention-Based 3D-Cnns for Large-Vocabulary Sign Language Recognition.

sEMG and IMU Data-Based Hand Gesture Recognition Method Using Multistream CNN With a Fine-Tuning Transfer Framework

TMS-Net: A multi-feature multi-stream multi-level information sharing network for skeleton-based sign language recognition

CNN Deep Learning with Wavelet Image Fusion of CCD RGB-IR and Depth-Grayscale Sensor Data for Hand Gesture Intention Recognition

Asymmetric multi-branch GCN for skeleton-based sign language recognition

Korean Sign Language Alphabet Recognition Through the Integration of Handcrafted and Deep Learning-Based Two-Stream Feature Extraction Approach