LLMs are Good Sign Language Translators

Jia Gong,Lin Geng Foo,Yixuan He,Hossein Rahmani,Jun Liu
2024-04-01
Abstract:Sign Language Translation (SLT) is a challenging task that aims to translate sign videos into spoken language. Inspired by the strong translation capabilities of large language models (LLMs) that are trained on extensive multilingual text corpora, we aim to harness off-the-shelf LLMs to handle SLT. In this paper, we regularize the sign videos to embody linguistic characteristics of spoken language, and propose a novel SignLLM framework to transform sign videos into a language-like representation for improved readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign module converts sign videos into a sequence of discrete character-level sign tokens, and (2) the Codebook Reconstruction and Alignment module converts these character-level tokens into word-level sign representations using an optimal transport formulation. A sign-text alignment loss further bridges the gap between sign and text tokens, enhancing semantic compatibility. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem discussed in this paper is how to use large-scale language models (LLMs) for sign language translation. Sign language translation is a task that converts sign language videos into spoken language, with challenges in cross-modal understanding and analysis of visual and linguistic cues. Due to the lack of sufficient paired sign language-text data, this task is more difficult. The paper proposes a novel framework called SignLLM, which aims to improve the understanding and handling of sign language translation by converting sign language videos into language-like representations using pretrained LLMs. SignLLM consists of two key modules: 1) Vector-Quantized Visual Sign module, which converts sign language videos into discrete character-level gesture token sequences; 2) Codebook Reconstruction and Alignment module, which converts these character-level tokens into word-level gesture representations. In addition, semantic compatibility is enhanced by a sign-text alignment loss that reduces the gap between sign and text tokens. Experimental results show that SignLLM achieves state-of-the-art results without a vocabulary on two popular sign language translation benchmarks. The main contributions of this paper include: 1) proposing the SignLLM framework, which utilizes pretrained and frozen LLMs for sign language translation for the first time; 2) designing the VQ-Sign module to quantize sign language videos into discrete character-level gesture tokens, and the CRA module to convert these tokens into word-level gesture representations; 3) achieving state-of-the-art results without a vocabulary on two popular sign language translation datasets through these designs. The paper investigates how to convert sign language videos into forms with language-like features to leverage the powerful translation ability of LLMs, which is an innovative attempt for this challenging task of sign language translation.