Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

Ronglai Zuo,Rolandos Alexandros Potamias,Evangelos Ververas,Jiankang Deng,Stefanos Zafeiriou
2024-11-27
Abstract:Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. While many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), drawing inspiration from its linguistic characteristics, the reverse task of sign language generation (SLG, text-to-sign) remains largely unexplored. Most existing approaches treat SLG as a visual content generation task, employing techniques such as diffusion models to produce sign videos, 2D keypoints, or 3D avatars based on text inputs, overlooking the linguistic properties of sign languages. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets. To facilitate multilingual SLG research, we further curate a large-scale Chinese sign language dataset, CSL-Daily, with high-quality 3D pose annotations. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. The project page is available at <a class="link-external link-https" href="https://2000zrl.github.io/soke/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations existing in the current sign language generation (SLG, text - to - sign) tasks. Specifically: 1. **Existing methods overlook the linguistic characteristics of sign languages**: Most existing SLG methods regard sign language generation as a visual content generation task and use techniques such as diffusion models to generate sign language videos, 2D key points or 3D avatars according to text input. However, these methods often ignore the semantic and grammatical characteristics of sign languages, resulting in unsatisfactory generation effects. 2. **Lack of multilingual support**: Current sign language generation research mostly focuses on a single language and lacks support for multiple sign languages, which limits the application of sign language generation technology in international communication. 3. **Insufficient modeling of hand movements**: Hand movements in sign languages are crucial for information transmission, but existing methods perform poorly in modeling hand movements, especially when dealing with complex hand shapes. To overcome the above problems, the paper proposes a multilingual sign language generation model named "Signs as Tokens (SOKE)". This model solves these problems in the following ways: - **Introducing a discretized symbol space**: By developing a decoupled symbolizer (DETO), the continuous sign language movements are discretized into symbol sequences, thereby effectively capturing the semantic structure of sign languages. - **Multilingual support**: By using large - scale pre - trained multilingual models (such as mBART), the model can handle multiple sign languages and generalize between different languages. - **Enhancing hand movement modeling**: DETO improves the expressive ability of hand movements by modeling hand and body movements separately, thereby generating more natural and more accurate sign language movements. In conclusion, this paper aims to improve the quality of sign language generation and multilingual support ability by introducing a new sign language generation method, thereby promoting effective communication between the deaf - mute and the hearing - normal.