Isolated Video-Based Sign Language Recognition Using a Hybrid CNN-LSTM Framework Based on Attention Mechanism

Diksha Kumari,Radhey Shyam Anand
DOI: https://doi.org/10.3390/electronics13071229
IF: 2.9
2024-03-27
Electronics
Abstract:Sign language is a complex language that uses hand gestures, body movements, and facial expressions and is majorly used by the deaf community. Sign language recognition (SLR) is a popular research domain as it provides an efficient and reliable solution to bridge the communication gap between people who are hard of hearing and those with good hearing. Recognizing isolated sign language words from video is a challenging research area in computer vision. This paper proposes a hybrid SLR framework that combines a convolutional neural network (CNN) and an attention-based long-short-term memory (LSTM) neural network. We used MobileNetV2 as a backbone model due to its lightweight structure, which reduces the complexity of the model architecture for deriving meaningful features from the video frame sequence. The spatial features are fed to LSTM optimized with an attention mechanism to select the significant gesture cues from the video frames and focus on salient features from the sequential data. The proposed method is evaluated on a benchmark WLASL dataset with 100 classes based on precision, recall, F1-score, and 5-fold cross-validation metrics. Our methodology acquired an average accuracy of 84.65%. The experiment results illustrate that our model performed effectively and computationally efficiently compared to other state-of-the-art methods.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the problem of Isolated Sign Language Recognition (ISLR). Specifically, the authors propose a new framework that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory networks (LSTM) with an attention mechanism to recognize isolated sign language vocabulary from videos. #### Main Contributions: 1. **Hybrid CNN-LSTM Framework**: A model based on CNN and LSTM is proposed, with an attention mechanism applied to the LSTM output layer to detect spatiotemporal features. 2. **Attention Mechanism**: The attention layer assigns different weights through a probability distribution, focusing on relevant cues in the sequence to improve the accuracy of sign language gesture recognition. 3. **Lightweight Architecture**: The designed model has a lightweight structure with a moderate number of parameters, outperforming existing methods in computational efficiency. 4. **Performance Evaluation**: The effectiveness and robustness of the model are evaluated through multiple performance metrics and K-fold cross-validation. ### Experimental Setup - **Dataset**: A subset of the Word Level American Sign Language (WLASL) dataset is used, containing 100 classes of sign language vocabulary with a total of 2038 video samples. - **Data Preprocessing**: Video frames are padded, scaled, and normalized to ensure all video frame lengths are consistent and pixel values range from [0, 1]. - **Model Architecture**: The model uses MobileNetV2 as a feature extractor to extract spatial features; these features are then fed into an LSTM with an attention mechanism to further learn temporal relationships. Through the above methods, the proposed model achieves an average accuracy of 84.65% on the WLASL dataset and demonstrates excellent computational efficiency.