Abstract:Sign language is one of the most effective communication tools for people with hearing difficulties. Most existing works focus on improving the performance of sign language tasks on RGB videos, which may suffer from degraded recording conditions, such as fast movement of hands with motion blur and textured signer's appearance. The bio-inspired event camera, which asynchronously captures brightness change with high speed, could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks. In this work, we aim at exploring the potential of event camera in continuous sign language recognition (CSLR) and sign language translation (SLT). To promote the research, we first collect an event-based benchmark EvSign for those tasks with both gloss and spoken language annotations. EvSign dataset offers a substantial amount of high-quality event streams and an extensive vocabulary of glosses and words, thereby facilitating the development of sign language tasks. In addition, we propose an efficient transformer-based framework for event-based SLR and SLT tasks, which fully leverages the advantages of streaming events. The sparse backbone is employed to extract visual features from sparse events. Then, the temporal coherence is effectively utilized through the proposed local token fusion and gloss-aware temporal aggregation modules. Extensive experimental results are reported on both simulated (PHOENIX14T) and EvSign datasets. Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost (0.84G FLOPS per video) and 44.2% network parameters. The project is available at <a class="link-external link-https" href="https://zhang-pengyu.github.io/EVSign" rel="external noopener nofollow">this https URL</a>.

TIM-SLR: a Lightweight Network for Video Isolated Sign Language Recognition

StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition

TMS-Net: A multi-feature multi-stream multi-level information sharing network for skeleton-based sign language recognition

Two-Stream Network for Sign Language Recognition and Translation

Multi-View Spatial-Temporal Network for Continuous Sign Language Recognition

Sign Language Recognition with Long Short-Term Memory.

A new system for Chinese sign language recognition

Video-Based Sign Language Recognition Without Temporal Segmentation

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

Natural Language-Assisted Sign Language Recognition

Attention-Based 3D-Cnns for Large-Vocabulary Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition

Towards Online Continuous Sign Language Recognition and Translation

EvSign: Sign Language Recognition and Translation with Streaming Events

Video-Based Sign Language Recognition via ResNet and LSTM Network

Enhancing Signer-Independent Recognition of Isolated Sign Language through Advanced Deep Learning Techniques and Feature Fusion

Sign language recognition using real-sense

Combinational sign language recognition

Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm

Isolated Video-Based Sign Language Recognition Using a Hybrid CNN-LSTM Framework Based on Attention Mechanism