Scaling up Multimodal Pre-training for Sign Language Understanding

Wengang Zhou,Weichao Zhao,Hezhen Hu,Zecheng Li,Houqiang Li
2024-08-16
Abstract:Sign language serves as the primary meaning of communication for the deaf-mute community. Different from spoken language, it commonly conveys information by the collaboration of manual features, i.e., hand gestures and body movements, and non-manual features, i.e., facial expressions and mouth cues. To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied in recent years, including isolated/continuous sign language recognition (ISLR/CSLR), gloss-free sign language translation (GF-SLT) and sign language retrieval (SL-RT). Sign language recognition and translation aims to understand the semantic meaning conveyed by sign languages from gloss-level and sentence-level, respectively. In contrast, SL-RT focuses on retrieving sign videos or corresponding texts from a closed-set under the query-by-example search paradigm. These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos. To advance the development of sign language understanding, exploring a generalized model that is applicable across various SLU tasks is a profound research direction.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily aims to address two core issues in the task of Sign Language Understanding (SLU): 1. **Data Scarcity**: Existing sign language pre-training methods are typically constrained by small-scale datasets specific to certain tasks, which limits the model's generalization ability across different tasks. Although some studies attempt to introduce more available data, the diversity of these data is still restricted by the specific task requirements. 2. **Insufficient Utilization of Multimodal Information**: Most existing methods rely solely on the visual modality to extract effective information, neglecting the importance of textual information. This leads to bottlenecks in semantic understanding and performance of the pre-trained models. To address these issues, the authors propose a multimodal sign language pre-training framework. This framework leverages large-scale sign language-text paired data (approximately 1.5 million pairs) and combines visual contextual cues with visual-text semantic consistency to enhance the representational capacity of sign language videos. Additionally, they have collected a large-scale labeled sign language pose dataset (SL-1.5M) and employ a multi-task pre-training strategy (including sign language-text contrastive learning and masked pose modeling) to improve the model's representation ability. Experimental results demonstrate that this method achieves state-of-the-art performance in various SLU tasks.