Abstract:Automatic sign language recognition (SLR) stands as a vital aspect within the realms of human–computer interaction and computer vision, facilitating the conversion of hand signs utilized by individuals with significant hearing and speech impairments into equivalent text or voice. Researchers have recently used hand skeleton joint information instead of the image pixel due to light illumination and complex background-bound problems. However, besides the hand information, body motion and facial gestures play an essential role in expressing sign language emotion. Also, a few researchers have been working to develop an SLR system by taking a multi-gesture dataset, but their performance accuracy and time complexity are not sufficient. In light of these limitations, we introduce a spatial and temporal attention model amalgamated with a general neural network designed for the SLR system. The main idea of our architecture is first to construct a fully connected graph to project the skeleton information. We employ self-attention mechanisms to extract insights from node and edge features across spatial and temporal domains. Our architecture bifurcates into three branches: a graph-based spatial branch, a graph-based temporal branch, and a general neural network branch, which collectively synergize to contribute to the final feature integration. Specifically, the spatial branch discerns spatial dependencies, while the temporal branch amplifies temporal dependencies embedded within the sequential hand skeleton data. Further, the general neural network branch enhances the architecture's generalization capabilities, bolstering its robustness. In our evaluation, utilizing the Mexican Sign Language (MSL), Pakistani Sign Language (PSL) datasets, and American Sign Language Large Video dataset which comprises 3D joint coordinates for face, body, and hands that conducted experiments on individual gestures and their combinations. Impressively, our model demonstrated notable efficacy, achieving an accuracy rate of 99.96% for the MSL dataset, 92.00% for PSL, and 26.00% for the ASLLVD dataset, which includes more than 2700 classes. These exemplary performance metrics, coupled with the model's computationally efficient profile, underscore its preeminence compared to contemporaneous methodologies in the field.

Skeleton-Aware Neural Sign Language Translation.

Skeleton Aware Multi-modal Sign Language Recognition

Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble

TMS-Net: A multi-feature multi-stream multi-level information sharing network for skeleton-based sign language recognition

Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video

Sign Language Recognition Based On Facial Expression and Hand Skeleton

Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition

Natural Language-Assisted Sign Language Recognition

Two-Stream Network for Sign Language Recognition and Translation

StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition

SKIM: Skeleton-Based Isolated Sign Language Recognition With Part Mixing

Hierarchical LSTM for Sign Language Translation.

Sign Language Recognition with Long Short-Term Memory.

Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

Improving End-to-end Sign Language Translation with Adaptive Video Representation Enhanced Transformer

SF-Net: Structured Feature Network for Continuous Sign Language Recognition

Spatial–temporal attention with graph and general neural network-based sign language recognition

SLTUNET: A Simple Unified Model for Sign Language Translation

Sign Language Translation with Hierarchical Spatio-TemporalGraph Neural Network