Aligning accumulative representations for sign language recognition

Ahmet Alp Kındıroglu,Oğulcan Özdemir,Lale Akarun
DOI: https://doi.org/10.1007/s00138-022-01367-x
IF: 2.983
2022-12-27
Machine Vision and Applications
Abstract:Accumulative representations provide a method for representing variable-length videos with constant length features. In this study, we present aligned temporal accumulative features (ATAF), a skeleton heatmap-based feature for efficient representation and modeling of isolated sign language videos. Inspired by the movement-hold model in sign linguistics, we extract keyframes, align them using temporal transformer networks (TTNs) and extract descriptors using convolutional neural networks (CNNs). In the proposed approach, the use of aligned keyframes increases the recognition power of accumulative features as linguistically significant parts of signs are represented uniquely. Since we detect keyframes using hand movement, there can be differences from signer to signer. To overcome this challenge, ATAF has been implemented with both alignment of sampled frames and keyframe alignment approaches, using both finger speed differences and hand joint heatmaps to perform end-to-end alignment during classification. Results demonstrate that the proposed method achieves state-of-the-art recognition performance on the public BosphorusSign22k (BSign22k) dataset in combination with 3D-CNNs.
computer science, cybernetics, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?