Abstract:Continuous sign language recognition (CSLR) aims to identify a sequence of glosses from a sign language video with only a sentence-level label provided in a weakly supervised way. In sign language videos, the transitions among actions are naturally fluent, and different glosses or the same gloss correspond to video clips with various temporal scales. Obviously, these factors pose a challenge to the effective extraction of complex temporal information. However, most previous deep learning-based CSLR methods employ a temporal modeling method with a fixed temporal receptive field, which is a simple and effective solution but does not cope well with video clips that have various temporal scales. To relieve this problem, we propose a dual-stage temporal perception module (DTPM) by leveraging the strengths of both temporal convolutions and transformers, which follows a hierarchical structure with dual stages aimed at capturing richer and more comprehensive temporal features. Specifically, each stage for DTPM is cleverly composed of two parts: a multi-scale local temporal module (MS-LTM), followed by a set of global–local temporal modules (GLTMs), where each GLTM can be further decomposed into a global temporal relational module (GTRM) and a local temporal relational module (LTRM). At each stage, an MS-LTM is first employed to model multi-scale local temporal relations and then utilize a set of GLTMs to model global temporal relations and strengthen local temporal relations. We finally aggregate the output features of each stage to form a video feature representation with rich semantic information. Extensive experiments on three CSLR benchmarks, PHOENIX14 (Koller et al. Comput Vis Image Underst 141:108–125, 2015), PHOENIX14-T (Camgoz et al., in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7784–7793, 2018), and CSL (Huang et al., in: Proceedings of the AAAI conference on artificial intelligence, pp 32, 2018), validate the effectiveness of our proposed method.

Connectionist Temporal Fusion For Sign Language Translation

Temporal superimposed crossover module for effective continuous sign language

Spatial–temporal transformer for end-to-end sign language recognition

Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation

Towards Online Continuous Sign Language Recognition and Translation

Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition

Hierarchical LSTM for Sign Language Translation.

Multi-View Spatial-Temporal Network for Continuous Sign Language Recognition

Collaborative Multilingual Continuous Sign Language Recognition: A Unified Framework

Dual-stage temporal perception network for continuous sign language recognition

SlowFast Network for Continuous Sign Language Recognition

Leveraging Graph-based Cross-modal Information Fusion for Neural Sign Language Translation

Contrastive Learning for Sign Language Recognition and Translation.

An Improved Sign Language Translation Model with Explainable Adaptations for Processing Long Sign Sentences

Improving End-to-end Sign Language Translation with Adaptive Video Representation Enhanced Transformer

Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

Prior-aware Cross Modality Augmentation Learning for Continuous Sign Language Recognition

Full transformer network with masking future for word-level sign language recognition

Two-Stream Network for Sign Language Recognition and Translation