Full transformer network with masking future for word-level sign language recognition

Yao Du,Pan Xie,Mingye Wang,Xiaohui Hu,Zheng Zhao,Jiaqi Liu
DOI: https://doi.org/10.1016/j.neucom.2022.05.051
IF: 6
2022-08-21
Neurocomputing
Abstract:Word-level sign language recognition (SLR) is a significant task which transcribes a sign language video into a word. Currently, deep-learning-based frameworks mostly combine spatial feature extractors based on convolution neural networks (CNNs) and sequence learners. These methods either lack the sufficient capacity to establish the high-level vision semantic knowledge and incorporate the details in images or perform weak intelligence on video frame sequence comprehension. Focusing on gestures and facial expressions is essential to interpreting sign language; however, it is challenging to crop these elements from pictures and distill them end-to-end. In this paper, a full self-attention framework for word-level SLR is proposed to tackle the above issue, which integrates a Vision Transformer as spatial encoder and an improved temporal Transformer. In addition, we introduce the masking future operation to improve the Transformer for the temporal module. The vision Transformer first refines the latent high-level semantic feature sequences from sign language videos and feeds them into the temporal module. Then the masking future Transformer enhances this sequence by making subsequent time invisible at each moment of frames and generates the final recognition. This approach integrates global and local spatial information; furthermore, it can also distinguish the latent semantic features contained in sign language action sequences. To validate the proposed approach, we perform extensive experiments on two datasets. The results and ablation studies demonstrate the effectiveness of this method, and it achieves new state-of-the-art performance on the WLASL dataset by using RGB images alone.
computer science, artificial intelligence
What problem does this paper attempt to address?