Abstract:Sign language is one of the most effective communication tools for people with hearing difficulties. Most existing works focus on improving the performance of sign language tasks on RGB videos, which may suffer from degraded recording conditions, such as fast movement of hands with motion blur and textured signer's appearance. The bio-inspired event camera, which asynchronously captures brightness change with high speed, could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks. In this work, we aim at exploring the potential of event camera in continuous sign language recognition (CSLR) and sign language translation (SLT). To promote the research, we first collect an event-based benchmark EvSign for those tasks with both gloss and spoken language annotations. EvSign dataset offers a substantial amount of high-quality event streams and an extensive vocabulary of glosses and words, thereby facilitating the development of sign language tasks. In addition, we propose an efficient transformer-based framework for event-based SLR and SLT tasks, which fully leverages the advantages of streaming events. The sparse backbone is employed to extract visual features from sparse events. Then, the temporal coherence is effectively utilized through the proposed local token fusion and gloss-aware temporal aggregation modules. Extensive experimental results are reported on both simulated (PHOENIX14T) and EvSign datasets. Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost (0.84G FLOPS per video) and 44.2% network parameters. The project is available at <a class="link-external link-https" href="https://zhang-pengyu.github.io/EVSign" rel="external noopener nofollow">this https URL</a>.

Progressive Sign Language Video Translation Model for Real-World Complex Background Environments

A Chinese Continuous Sign Language Dataset Based on Complex Environments

Towards Online Continuous Sign Language Recognition and Translation

Contrastive Learning for Sign Language Recognition and Translation.

Improving Continuous Sign Language Recognition with Adapted Image Models

Improving End-to-end Sign Language Translation with Adaptive Video Representation Enhanced Transformer

Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm

Improving Sign Language Translation with Monolingual Data by Sign Back-Translation

Real-Time Vision-Based Chinese Sign Language Recognition with Pose Estimation and Attention Network

A two-way translation system of Chinese sign language based on computer vision

EvSign: Sign Language Recognition and Translation with Streaming Events

CSLNSpeech: solving the extended speech separation problem with the help of Chinese Sign Language

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

SimulSLT: End-to-End Simultaneous Sign Language Translation

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

CSLNSpeech: solving extended speech separation problem with the help of Chinese sign language

SCOPE: Sign Language Contextual Processing with Embedding from LLMs

An Improved Sign Language Translation Model with Explainable Adaptations for Processing Long Sign Sentences

Temporal superimposed crossover module for effective continuous sign language

Video-Based Sign Language Recognition Without Temporal Segmentation

A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars