Abstract:Deep-learning-based continuous sign language recognition (CSLR) models typically consist of a visual module, a sequential module, and an alignment module. However, the effectiveness of training such CSLR backbones is hindered by limited training samples, rendering the use of a single connectionist temporal classification loss insufficient. To address this limitation, we propose three auxiliary tasks to enhance CSLR backbones. First, we enhance the visual module, which is particularly sensitive to the challenges posed by limited training samples, from the perspective of consistency. Specifically, since sign languages primarily rely on signers’ facial expressions and hand movements to convey information, we develop a keypoint-guided spatial attention module that directs the visual module to focus on informative regions, thereby ensuring spatial attention consistency. Furthermore, recognizing that the output features of both the visual and sequential modules represent the same sentence, we leverage this prior knowledge to better exploit the power of the backbone. We impose a sentence embedding consistency constraint between the visual and sequential modules, enhancing the representation power of both features. The resulting CSLR model, referred to as consistency-enhanced CSLR, demonstrates superior performance on signer-dependent datasets, where all signers appear during both training and testing. To enhance its robustness for the signer-independent setting, we propose a signer removal module based on feature disentanglement, effectively eliminating signer-specific information from the backbone. To validate the effectiveness of the proposed auxiliary tasks, we conduct extensive ablation studies. Notably, utilizing a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, including PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.

Gloss Prior Guided Visual Feature Learning for Continuous Sign Language Recognition

Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition

Prior-aware Cross Modality Augmentation Learning for Continuous Sign Language Recognition

Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

Denoising-Contrastive Alignment for Continuous Sign Language Recognition

Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation

Gloss Attention for Gloss-Free Sign Language Translation

Continuous Sign Language Recognition Using Intra-inter Gloss Attention

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

C${^2}$RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval

Improving Gloss-free Sign Language Translation by Reducing Representation Density

Gloss-Free End-to-End Sign Language Translation

Improving Continuous Sign Language Recognition with Adapted Image Models

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

Natural Language-Assisted Sign Language Recognition

Global-local Enhancement Network for NMFs-aware Sign Language Recognition

A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

Improving Continuous Sign Language Recognition with Cross-Lingual Signs

Global-Local Enhancement Network for NMF-Aware Sign Language Recognition

Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization.