Abstract:Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.

What problem does this paper attempt to address?

This paper focuses on accurately detecting the boundaries of isolated hand gestures in Continuous Sign Language Recognition (CSLR). The main challenges faced by current CSLR systems include recognizing the start and end of isolated hand gestures in continuous video streams and reliance on handcrafted features, which limit their accuracy. To address these issues, the paper proposes a novel Transformer-based model that does not require handcrafted features and focuses on isolated hand gesture videos during training but can handle continuous gesture videos during application, achieving the integration of ISLR and CSLR. In the training phase, the model utilizes the hand keypoint features of input videos and enhances them using the Transformer model. The enhanced features are then fed into the final classification layer. In the inference phase, the trained model combines with post-processing methods to detect the boundaries of isolated hand gestures in continuous gesture videos. Experimental results demonstrate the good performance of the model on two different datasets. The paper also compares other related works, including the application of pre-trained and non-pretrained models in SLR, emphasizing the advantages of the Transformer model in handling long-range dependencies and improving recognition accuracy. The proposed model architecture consists of four steps: 3D hand keypoint estimation, feature enrichment, classification, and post-processing prediction. The experimental section showcases the performance of the model on two datasets, validating the effectiveness and accuracy of the Transformer model in boundary detection. In summary, the aim of the paper is to address the boundary detection problem in Continuous Sign Language Recognition by introducing the Transformer model to improve accuracy, reduce reliance on handcrafted features, and achieve unified processing of isolated and continuous hand gestures. Despite some success, the paper also discusses the limitations of the datasets and challenges posed by differences in hand gestures among different users, pointing out the need for further verification of the model's generalizability on real-world datasets in future work.

A Transformer Model for Boundary Detection in Continuous Sign Language

Word separation in continuous sign language using isolated signs and post-processing

Continuous Sign Language Recognition Using Intra-inter Gloss Attention

Continuous Sign Language Recognition Via Reinforcement Learning

Full transformer network with masking future for word-level sign language recognition

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

Spatial–temporal transformer for end-to-end sign language recognition

Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining

Towards Online Continuous Sign Language Recognition and Translation

A Transformer-Based Multi-Stream Approach for Isolated Iranian Sign Language Recognition

ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos

Korean Sign Language Recognition Using Transformer-Based Deep Neural Network

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

Isolated Arabic Sign Language Recognition Using A Transformer-based Model and Landmark Keypoints

Improving Continuous Sign Language Recognition with Cross-Lingual Signs

Boosting Continuous Sign Language Recognition via Cross Modality Augmentation

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Improving Continuous Sign Language Recognition with Adapted Image Models

Semantic Boundary Detection with Reinforcement Learning for Continuous Sign Language Recognition.

Continuous Sign Language Recognition Based on Motor attention mechanism and frame-level Self-distillation