Abstract:Sign language recognition (SLR) is one of the crucial applications of the hand gesture recognition and computer vision research domain. There are many researchers who have been working to develop a hand gesture-based SLR application for English, Turkey, Arabic, and other sign languages. However, few studies have been conducted on Korean sign language classification because few KSL datasets are publicly available. In addition, the existing Korean sign language recognition work still faces challenges in being conducted efficiently because light illumination and background complexity are the major problems in this field. In the last decade, researchers successfully applied a vision-based transformer for recognizing sign language by extracting long-range dependency within the image. Moreover, there is a significant gap between the CNN and transformer in terms of the performance and efficiency of the model. In addition, we have not found a combination of CNN and transformer-based Korean sign language recognition models yet. To overcome the challenges, we proposed a convolution and transformer-based multi-branch network aiming to take advantage of the long-range dependencies computation of the transformer and local feature calculation of the CNN for sign language recognition. We extracted initial features with the grained model and then parallelly extracted features from the transformer and CNN. After concatenating the local and long-range dependencies features, a new classification module was applied for the classification. We evaluated the proposed model with a KSL benchmark dataset and our lab dataset, where our model achieved 89.00% accuracy for 77 label KSL dataset and 98.30% accuracy for the lab dataset. The higher performance proves that the proposed model can achieve a generalized property with considerably less computational cost.

Full transformer network with masking future for word-level sign language recognition

Spatial–temporal transformer for end-to-end sign language recognition

Improving End-to-end Sign Language Translation with Adaptive Video Representation Enhanced Transformer

Continuous Sign Language Recognition Via Reinforcement Learning

Sign Language Production with Latent Motion Transformer

Video-Based Sign Language Recognition Without Temporal Segmentation

Sign language recognition from digital videos using feature pyramid network with detection transformer

Heterogeneous Attention Based Transformer for Sign Language Translation

Korean Sign Language Recognition Using Transformer-Based Deep Neural Network

Multi-View Spatial-Temporal Network for Continuous Sign Language Recognition

ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

A Transformer Model for Boundary Detection in Continuous Sign Language

A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training

Prior Knowledge and Memory Enriched Transformer for Sign Language Translation

Natural Language-Assisted Sign Language Recognition

StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition

Generalizations of Wearable Device Placements and Sentences in Sign Language Recognition With Transformer-Based Model

MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

SLGTformer: An Attention-Based Approach to Sign Language Recognition