Abstract:Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for encoding it. With the advent of deep learning methods in recent years, significant advances have been made in natural language processing (specifically neural machine translation) and in computer vision methods (specifically image and video captioning). Researchers have therefore begun expanding these learning methods to sign language understanding. Sign language interpretation is especially challenging, because it involves a continuous visual-spatial modality where meaning is often derived based on context. The focus of this article, therefore, is to examine various deep learning–based methods for encoding sign language as inputs, and to analyze the efficacy of several machine translation methods, over three different sign language datasets. The goal is to determine which combinations are sufficiently robust for sign language translation without any gloss-based information. To understand the role of the different input features, we perform ablation studies over the model architectures (input features + neural translation models) for improved continuous sign language translation. These input features include body and finger joints, facial points, as well as vector representations/embeddings from convolutional neural networks. The machine translation models explored include several baseline sequence-to-sequence approaches, more complex and challenging networks using attention, reinforcement learning, and the transformer model. We implement the translation methods over multiple sign languages—German (GSL), American (ASL), and Chinese sign languages (CSL). From our analysis, the transformer model combined with input embeddings from ResNet50 or pose-based landmark features outperformed all the other sequence-to-sequence models by achieving higher BLEU2-BLEU4 scores when applied to the controlled and constrained GSL benchmark dataset. These combinations also showed significant promise on the other less controlled ASL and CSL datasets.

Heterogeneous Attention Based Transformer for Sign Language Translation

SignAttention: On the Interpretability of Transformer Models for Sign Language Translation

Improving End-to-end Sign Language Translation with Adaptive Video Representation Enhanced Transformer

Hierarchical LSTM for Sign Language Translation.

Prior Knowledge and Memory Enriched Transformer for Sign Language Translation

An Improved Sign Language Translation Model with Explainable Adaptations for Processing Long Sign Sentences

Full transformer network with masking future for word-level sign language recognition

SLTUNET: A Simple Unified Model for Sign Language Translation

Sign Language Translation with Hierarchical Spatio-TemporalGraph Neural Network

Spatial–temporal transformer for end-to-end sign language recognition

Sign Language Translation with Hierarchical Spatio-Temporal Graph Neural Network

Better Sign Language Translation with STMC-Transformer

SLGTformer: An Attention-Based Approach to Sign Language Recognition

SimulSLT: End-to-End Simultaneous Sign Language Translation

Multi-channel Transformers for Multi-articulatory Sign Language Translation

Attentional bias for hands: Cascade dual‐decoder transformer for sign language production

Two-Stream Network for Sign Language Recognition and Translation

Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation

Deep Learning Methods for Sign Language Translation

Connectionist Temporal Fusion For Sign Language Translation

Sign Language Production with Latent Motion Transformer