Abstract:An efficient cascade dual decoder Transformer model is presented, which heuristically optimises mappings among text, hand pose, and full‐articulatory pose for sign language production (SLP). In addition, a novel spatio‐temporal loss is introduced to provide more efficacious guidance for SLP models. Both quantitative and qualitative results show that the proposed SLP model with spatio‐temporal loss function achieves state‐of‐the‐art results on both German and Korean SLP tasks. Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non‐manual information channels. Recent deep learning‐based SLP models directly generate the full‐articulatory sign sequence from the text input in an end‐to‐end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual‐decoder Transformer (CasDual‐Transformer) for SLP is proposed to learn, successively, two mappings SLPhand: Text → Hand pose and SLPsign: Text → Sign pose, utilising an attention‐based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio‐temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors' model demonstrates competitive performance compared to state‐of‐the‐art models, and in some cases, achieves considerable improvements over them.

DualSign: Semi-Supervised Sign Language Production with Balanced Multi-Modal Multi-Task Dual Transformation

Sign Language Production with Latent Motion Transformer

Attentional bias for hands: Cascade dual‐decoder transformer for sign language production

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Towards Fast and High-Quality Sign Language Production

Cross-modality Data Augmentation for End-to-End Sign Language Translation

MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

SignLLM: Sign Language Production Large Language Models

SimulSLT: End-to-End Simultaneous Sign Language Translation

G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

Semi-Supervised Spoken Language Glossification

SLTUNET: A Simple Unified Model for Sign Language Translation

Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video

Improving Sign Language Translation with Monolingual Data by Sign Back-Translation

Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation

Sign Stitching: A Novel Approach to Sign Language Production

Contrastive Disentangled Meta-Learning for Signer-Independent Sign Language Translation.

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks