RNN-Transducer Based Chinese Sign Language Recognition

Liqing Gao,Haibo Li,Zhijian Liu,Zekang Liu,Liang Wan,Wei Feng
DOI: https://doi.org/10.1016/j.neucom.2020.12.006
IF: 6
2020-01-01
Neurocomputing
Abstract:Sign Language Recognition (SLR) targets on interpreting sign language video into natural language, which largely facilitates mutual communication between the deaf and general public. SLR is usually formulated as a sequence alignment problem, wherein connectionist temporal classification (CTC) plays an important role in building effective alignment between video sequence and sentence-level labels. However, CTC-based SLR methods tend to fail if the output label sequence is longer than the input video sequence. Besides, they ignore the interdependencies between output predictions. This paper addresses these two issues and proposes a new RNN-Transducer based SLR framework, i.e., visual hierarchy to lexical sequence alignment network (H2SNet). In the framework, we design a visual hierarchy transcription network to capture the spatial appearance and temporal motion cues of sign video on multiple levels. Meanwhile, we utilize a lexical prediction network to extract effective contextual information from output predictions. RNN-Transducer is applied to learn the mapping between sequential video features and sentence-level labels. Extensive experiments validate the effectiveness and superiority of our approach over state-of-the-art methods.
What problem does this paper attempt to address?