American Sign Language to Text Translation using Transformer and Seq2Seq with LSTM
Gregorius Guntur Sunardi Putra,Adifa Widyadhani Chanda D'Layla,Dimas Wahono,Riyanarto Sarno,Agus Tri Haryono
2024-09-17
Abstract:Sign language translation is one of the important issues in communication between deaf and hearing people, as it expresses words through hand, body, and mouth movements. American Sign Language is one of the sign languages used, one of which is the alphabetic sign. The development of neural machine translation technology is moving towards sign language translation. Transformer became the state-of-the-art in natural language processing. This study compares the Transformer with the Sequence-to-Sequence (Seq2Seq) model in translating sign language to text. In addition, an experiment was conducted by adding Residual Long Short-Term Memory (ResidualLSTM) in the Transformer. The addition of ResidualLSTM to the Transformer reduces the performance of the Transformer model by 23.37% based on the BLEU Score value. In comparison, the Transformer itself increases the BLEU Score value by 28.14 compared to the Seq2Seq model.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the translation from American Sign Language (ASL) to text. Specifically, the author aims to improve the accuracy and comprehension ability of sign language translation by using the Transformer model and the sequence - to - sequence (Seq2Seq) model with LSTM. In addition, the author also tries to further optimize the model performance by introducing Residual LSTM in the Transformer.
### Main research questions:
1. **Accuracy of sign language translation**: How to improve the accuracy of sign language - to - text translation, especially when dealing with long - term dependencies and complex gestures.
2. **Performance comparison of different models**: Compare the performance of the Transformer model, the Seq2Seq model, and the Transformer model with Residual LSTM in the sign language translation task.
3. **Model optimization**: Explore whether adding Residual LSTM in the Transformer can further improve the model performance.
### Research background:
- Sign language is an important means of communication between the deaf community and the hearing - normal population, but the diversity and complexity of sign language make automatic translation a challenge.
- The development of neural machine translation technology provides new solutions for sign language translation, especially the successful application of the Transformer model in natural language processing.
### Research methods:
- An ASL data set containing facial, right - hand, and left - hand landmarks was used.
- The landmarks in the sign language video were converted into vectors through the embedding layer and used as the input of the Transformer encoder.
- Three models were compared: the Seq2Seq model, the Transformer model, and the Transformer model with Residual LSTM.
- Model performance was evaluated using metrics such as BLEU score and Character Error Rate (CER).
### Research results:
- The Transformer model improved the BLEU score by 28.14% compared to the Seq2Seq model, showing better translation quality.
- The Transformer model with Residual LSTM had a slight improvement in the BLEU score (4.77%), but performed worse in terms of Word Error Rate (WER), indicating that Residual LSTM may have a negative impact on the sign language translation task.
### Conclusions:
- The Transformer model performs well in the sign language - to - text translation task and is significantly better than the traditional Seq2Seq model.
- Although Residual LSTM can improve performance in some cases, it may introduce more errors in the sign language translation task.
- Future research can further optimize the model, especially in dealing with low - frequency characters and unstructured sign language.
Through these studies, the author hopes to provide valuable references for the development of sign language translation technology, thereby promoting effective communication between the deaf and the hearing - normal population.