Abstract:The aim of end-to-end sign language translation (SLT) is to interpret continuous sign language (SL) video sequences into coherent natural language sentences without any intermediary annotations, i.e., glosses. However, end-to-end SLT suffers several intractable issues: (i) the temporal correspondence constraint loss problem between SL videos and glosses, and (ii) the weakly supervised sequence labeling problem between long SL videos and sentences. To address these issues, we propose an adaptive video representation enhanced Transformer (AVRET), with three extra modules: adaptive masking (AM), local clip self-attention (LCSA) and adaptive fusion (AF). Specifically, we utilize the first AM module to generate a special mask that adaptively drops out temporally important SL video frame representations to enhance the SL video features. Then, we pass the masked video feature to the Transformer encoder consisting of LCSA and masked self-attention to learn clip-level and continuous video-level feature information. Finally, the output feature of encoder is fused with the temporal feature of AM module via the AF module and use the second AM module to generate more robust feature representations. Besides, we add weakly supervised loss terms to constrain these two AM modules. To promote the Chinese SLT research, we further construct CSL-FocusOn, a Chinese continuous SLT dataset, and share its collection method. It involves many common scenarios, and provides SL sentence annotations and multi-cue images of signers. Our experiments on the CSL-FocusOn, PHOENIX14T, and CSL-Daily datasets show that the proposed method achieves the competitive performance on the end-to-end SLT task without using glosses in training. The code is available at https://github.com/LzDddd/AVRET.

Spatial-Temporal Consistency Constraints for Chinese Sign Language Synthesis.

Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

Pose-Guided Fine-Grained Sign Language Video Generation

Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation

Connectionist Temporal Fusion For Sign Language Translation

Self-Supervised Representation Learning With Spatial-Temporal Consistency for Sign Language Recognition

Coherent Image Animation Using Spatial-Temporal Correspondence

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Research on the improved gesture tracking algorithm in sign language synthesis

Contrastive Learning for Sign Language Recognition and Translation.

Text-To-Visual Speech in Chinese Based on Data-Driven Approach

Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation

Sign Language Production with Latent Motion Transformer

Spatial–temporal transformer for end-to-end sign language recognition

Hierarchical LSTM for Sign Language Translation.

Improving End-to-end Sign Language Translation with Adaptive Video Representation Enhanced Transformer

Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video

Real-Time Vision-Based Chinese Sign Language Recognition with Pose Estimation and Attention Network

Multimodal Learning for Temporally Coherent Talking Face Generation with Articulator Synergy

DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism