Abstract:Automatic sign language recognition (SLR) stands as a vital aspect within the realms of human–computer interaction and computer vision, facilitating the conversion of hand signs utilized by individuals with significant hearing and speech impairments into equivalent text or voice. Researchers have recently used hand skeleton joint information instead of the image pixel due to light illumination and complex background-bound problems. However, besides the hand information, body motion and facial gestures play an essential role in expressing sign language emotion. Also, a few researchers have been working to develop an SLR system by taking a multi-gesture dataset, but their performance accuracy and time complexity are not sufficient. In light of these limitations, we introduce a spatial and temporal attention model amalgamated with a general neural network designed for the SLR system. The main idea of our architecture is first to construct a fully connected graph to project the skeleton information. We employ self-attention mechanisms to extract insights from node and edge features across spatial and temporal domains. Our architecture bifurcates into three branches: a graph-based spatial branch, a graph-based temporal branch, and a general neural network branch, which collectively synergize to contribute to the final feature integration. Specifically, the spatial branch discerns spatial dependencies, while the temporal branch amplifies temporal dependencies embedded within the sequential hand skeleton data. Further, the general neural network branch enhances the architecture's generalization capabilities, bolstering its robustness. In our evaluation, utilizing the Mexican Sign Language (MSL), Pakistani Sign Language (PSL) datasets, and American Sign Language Large Video dataset which comprises 3D joint coordinates for face, body, and hands that conducted experiments on individual gestures and their combinations. Impressively, our model demonstrated notable efficacy, achieving an accuracy rate of 99.96% for the MSL dataset, 92.00% for PSL, and 26.00% for the ASLLVD dataset, which includes more than 2700 classes. These exemplary performance metrics, coupled with the model's computationally efficient profile, underscore its preeminence compared to contemporaneous methodologies in the field.

Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

Pose-Guided Fine-Grained Sign Language Video Generation

Sign Language Production with Latent Motion Transformer

Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video

Connectionist Temporal Fusion For Sign Language Translation

Dynamical semantic enhancement network for continuous sign language recognition

Self-Supervised Representation Learning With Spatial-Temporal Consistency for Sign Language Recognition

Towards Fast and High-Quality Sign Language Production

Spatial–temporal transformer for end-to-end sign language recognition

Modeling the Speed and Timing of American Sign Language to Generate Realistic Animations

Real-Time Vision-Based Chinese Sign Language Recognition with Pose Estimation and Attention Network

Skeleton-Aware Neural Sign Language Translation.

Spatial–temporal attention with graph and general neural network-based sign language recognition

DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism

Improving End-to-end Sign Language Translation with Adaptive Video Representation Enhanced Transformer

Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition

Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation

Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation

TMS-Net: A multi-feature multi-stream multi-level information sharing network for skeleton-based sign language recognition

Learning to Score Sign Language with Two-stage Method