Abstract:Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for encoding it. With the advent of deep learning methods in recent years, significant advances have been made in natural language processing (specifically neural machine translation) and in computer vision methods (specifically image and video captioning). Researchers have therefore begun expanding these learning methods to sign language understanding. Sign language interpretation is especially challenging, because it involves a continuous visual-spatial modality where meaning is often derived based on context. The focus of this article, therefore, is to examine various deep learning–based methods for encoding sign language as inputs, and to analyze the efficacy of several machine translation methods, over three different sign language datasets. The goal is to determine which combinations are sufficiently robust for sign language translation without any gloss-based information. To understand the role of the different input features, we perform ablation studies over the model architectures (input features + neural translation models) for improved continuous sign language translation. These input features include body and finger joints, facial points, as well as vector representations/embeddings from convolutional neural networks. The machine translation models explored include several baseline sequence-to-sequence approaches, more complex and challenging networks using attention, reinforcement learning, and the transformer model. We implement the translation methods over multiple sign languages—German (GSL), American (ASL), and Chinese sign languages (CSL). From our analysis, the transformer model combined with input embeddings from ResNet50 or pose-based landmark features outperformed all the other sequence-to-sequence models by achieving higher BLEU2-BLEU4 scores when applied to the controlled and constrained GSL benchmark dataset. These combinations also showed significant promise on the other less controlled ASL and CSL datasets.

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video

Sentence2SignGesture: a hybrid neural machine translation network for sign language video generation

Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Neural Sign Actors: A diffusion model for 3D sign language production from text

Towards Fast and High-Quality Sign Language Production

Sign Language Production with Latent Motion Transformer

DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism

Sign Stitching: A Novel Approach to Sign Language Production

SignNet: Single Channel Sign Generation using Metric Embedded Learning

Deep Learning Methods for Sign Language Translation

Toward an example-based machine translation from written text to ASL using virtual agent animation

Changing the Representation: Examining Language Representation for Neural Sign Language Production

Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting

Natural Language-Assisted Sign Language Recognition

T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text

Development of an End-to-End Deep Learning Framework for Sign Language Recognition, Translation, and Video Generation

Unsupervised Sign Language Translation and Generation

American Sign Language Translation Using Wearable Inertial and Electromyography Sensors for Tracking Hand Movements and Facial Expressions

Example-Based Machine Translation from Text to a Hierarchical Representation of Sign Language