A Transformer Model for Boundary Detection in Continuous Sign Language

Razieh Rastgoo,Kourosh Kiani,Sergio Escalera
2024-02-23
Abstract:Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on accurately detecting the boundaries of isolated hand gestures in Continuous Sign Language Recognition (CSLR). The main challenges faced by current CSLR systems include recognizing the start and end of isolated hand gestures in continuous video streams and reliance on handcrafted features, which limit their accuracy. To address these issues, the paper proposes a novel Transformer-based model that does not require handcrafted features and focuses on isolated hand gesture videos during training but can handle continuous gesture videos during application, achieving the integration of ISLR and CSLR. In the training phase, the model utilizes the hand keypoint features of input videos and enhances them using the Transformer model. The enhanced features are then fed into the final classification layer. In the inference phase, the trained model combines with post-processing methods to detect the boundaries of isolated hand gestures in continuous gesture videos. Experimental results demonstrate the good performance of the model on two different datasets. The paper also compares other related works, including the application of pre-trained and non-pretrained models in SLR, emphasizing the advantages of the Transformer model in handling long-range dependencies and improving recognition accuracy. The proposed model architecture consists of four steps: 3D hand keypoint estimation, feature enrichment, classification, and post-processing prediction. The experimental section showcases the performance of the model on two datasets, validating the effectiveness and accuracy of the Transformer model in boundary detection. In summary, the aim of the paper is to address the boundary detection problem in Continuous Sign Language Recognition by introducing the Transformer model to improve accuracy, reduce reliance on handcrafted features, and achieve unified processing of isolated and continuous hand gestures. Despite some success, the paper also discusses the limitations of the datasets and challenges posed by differences in hand gestures among different users, pointing out the need for further verification of the model's generalizability on real-world datasets in future work.