Scaling up Multimodal Pre-training for Sign Language Understanding

Wengang Zhou,Weichao Zhao,Hezhen Hu,Zecheng Li,Houqiang Li

2024-08-16

Abstract:Sign language serves as the primary meaning of communication for the deaf-mute community. Different from spoken language, it commonly conveys information by the collaboration of manual features, i.e., hand gestures and body movements, and non-manual features, i.e., facial expressions and mouth cues. To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied in recent years, including isolated/continuous sign language recognition (ISLR/CSLR), gloss-free sign language translation (GF-SLT) and sign language retrieval (SL-RT). Sign language recognition and translation aims to understand the semantic meaning conveyed by sign languages from gloss-level and sentence-level, respectively. In contrast, SL-RT focuses on retrieving sign videos or corresponding texts from a closed-set under the query-by-example search paradigm. These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos. To advance the development of sign language understanding, exploring a generalized model that is applicable across various SLU tasks is a profound research direction.

Computer Vision and Pattern Recognition,Multimedia

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily aims to address two core issues in the task of Sign Language Understanding (SLU): 1. **Data Scarcity**: Existing sign language pre-training methods are typically constrained by small-scale datasets specific to certain tasks, which limits the model's generalization ability across different tasks. Although some studies attempt to introduce more available data, the diversity of these data is still restricted by the specific task requirements. 2. **Insufficient Utilization of Multimodal Information**: Most existing methods rely solely on the visual modality to extract effective information, neglecting the importance of textual information. This leads to bottlenecks in semantic understanding and performance of the pre-trained models. To address these issues, the authors propose a multimodal sign language pre-training framework. This framework leverages large-scale sign language-text paired data (approximately 1.5 million pairs) and combines visual contextual cues with visual-text semantic consistency to enhance the representational capacity of sign language videos. Additionally, they have collected a large-scale labeled sign language pose dataset (SL-1.5M) and employ a multi-task pre-training strategy (including sign language-text contrastive learning and masked pose modeling) to improve the model's representation ability. Experimental results demonstrate that this method achieves state-of-the-art performance in various SLU tasks.

Scaling up Multimodal Pre-training for Sign Language Understanding

Scaling Sign Language Translation

BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization.

SignBERT+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding.

Multimodal Pretraining from Monolingual to Multilingual

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

Video-Based Sign Language Recognition Without Temporal Segmentation

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

Sign Language Recognition with Multi-modal Features.

Hand-Model-Aware Sign Language Recognition

Two-Stream Network for Sign Language Recognition and Translation

Collaborative Multilingual Continuous Sign Language Recognition: A Unified Framework

Deep Learning Methods for Sign Language Translation

Difference-guided multi-scale spatial-temporal representation for sign language recognition

Conditional Sentence Generation and Cross-modal Reranking for Sign Language Translation

SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition

SCOPE: Sign Language Contextual Processing with Embedding from LLMs

SLTUNET: A Simple Unified Model for Sign Language Translation

Prior-aware Cross Modality Augmentation Learning for Continuous Sign Language Recognition

Multi-Modal Zero-Shot Sign Language Recognition

Natural Language-Assisted Sign Language Recognition