Abstract:Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for encoding it. With the advent of deep learning methods in recent years, significant advances have been made in natural language processing (specifically neural machine translation) and in computer vision methods (specifically image and video captioning). Researchers have therefore begun expanding these learning methods to sign language understanding. Sign language interpretation is especially challenging, because it involves a continuous visual-spatial modality where meaning is often derived based on context. The focus of this article, therefore, is to examine various deep learning–based methods for encoding sign language as inputs, and to analyze the efficacy of several machine translation methods, over three different sign language datasets. The goal is to determine which combinations are sufficiently robust for sign language translation without any gloss-based information. To understand the role of the different input features, we perform ablation studies over the model architectures (input features + neural translation models) for improved continuous sign language translation. These input features include body and finger joints, facial points, as well as vector representations/embeddings from convolutional neural networks. The machine translation models explored include several baseline sequence-to-sequence approaches, more complex and challenging networks using attention, reinforcement learning, and the transformer model. We implement the translation methods over multiple sign languages—German (GSL), American (ASL), and Chinese sign languages (CSL). From our analysis, the transformer model combined with input embeddings from ResNet50 or pose-based landmark features outperformed all the other sequence-to-sequence models by achieving higher BLEU2-BLEU4 scores when applied to the controlled and constrained GSL benchmark dataset. These combinations also showed significant promise on the other less controlled ASL and CSL datasets.

Contrastive Disentangled Meta-Learning for Signer-Independent Sign Language Translation.

Contrastive Learning for Sign Language Recognition and Translation.

MC-SLT: Towards Low-Resource Signer-Adaptive Sign Language Translation

Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation.

Deep Learning Methods for Sign Language Translation

Diverse Sign Language Translation

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

An Improved Sign Language Translation Model with Explainable Adaptations for Processing Long Sign Sentences

DualSign: Semi-Supervised Sign Language Production with Balanced Multi-Modal Multi-Task Dual Transformation

LLMs are Good Sign Language Translators

SignBLEU: Automatic Evaluation of Multi-channel Sign Language Translation

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

Improving End-to-end Sign Language Translation with Adaptive Video Representation Enhanced Transformer

SignNet: Single Channel Sign Generation using Metric Embedded Learning

MLSLT: Towards Multilingual Sign Language Translation.

Difference-guided multi-scale spatial-temporal representation for sign language recognition

Improving Continuous Sign Language Recognition with Adapted Image Models

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

SimulSLT: End-to-End Simultaneous Sign Language Translation

SLTUNET: A Simple Unified Model for Sign Language Translation