Continuous Sign Language Recognition Via Temporal Super-Resolution Network

Zhu Qidan,Li Jing,Yuan Fei,Gan Quan
DOI: https://doi.org/10.1007/s13369-023-07718-8
IF: 2.807
2023-01-01
Arabian Journal for Science and Engineering
Abstract:Aiming the problem that the spatial-temporal hierarchical continuous sign language recognition (CSLR) model with video as input is computationally intensive, thus limiting the real-time application, this paper proposes a temporal super-resolution network (TSRNet) to reduce the model computation while keeping the loss of accuracy to a minimum, achieving the best compromise between the real-time performance and accuracy. The TSRNet-based CSLR constructed in this paper consists of three main parts: frame-level feature extraction, temporal feature extraction and the proposed TSRNet, where the TSRNet is located between them, and consists of two branches: detail and coarse descriptors. The extracted frame-level features are first sparse, after which they are passed through the two branches designed for feature reconstruction; the fused dense sequence is subjected to temporal feature extraction. In order to better recover the semantic-level information, this paper also proposes a self-generating adversarial network training method, which treats the TSRNet as the generator and the frame-level and temporal processing parts as the discriminator. In addition, to unify the criteria for judging the loss of model accuracy under different benchmarks, this paper proposes word error rate deviation (WERD), where the error rate between estimated WER and reference WER obtained by reconstructed frame-level feature sequence and complete original frame-level feature sequence, respectively. Experiments on two large-scale sign language datasets demonstrate the effectiveness of the model. The method proposed in this paper is not only applicable to CSLR, but is general to spatial-temporal hierarchical models where the input is video data. Code is available at https://github.com/woshisad159/CSLR.git.
What problem does this paper attempt to address?