LLMs are Good Sign Language Translators

Jia Gong,Lin Geng Foo,Yixuan He,Hossein Rahmani,Jun Liu

2024-04-01

Abstract:Sign Language Translation (SLT) is a challenging task that aims to translate sign videos into spoken language. Inspired by the strong translation capabilities of large language models (LLMs) that are trained on extensive multilingual text corpora, we aim to harness off-the-shelf LLMs to handle SLT. In this paper, we regularize the sign videos to embody linguistic characteristics of spoken language, and propose a novel SignLLM framework to transform sign videos into a language-like representation for improved readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign module converts sign videos into a sequence of discrete character-level sign tokens, and (2) the Codebook Reconstruction and Alignment module converts these character-level tokens into word-level sign representations using an optimal transport formulation. A sign-text alignment loss further bridges the gap between sign and text tokens, enhancing semantic compatibility. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

The problem discussed in this paper is how to use large-scale language models (LLMs) for sign language translation. Sign language translation is a task that converts sign language videos into spoken language, with challenges in cross-modal understanding and analysis of visual and linguistic cues. Due to the lack of sufficient paired sign language-text data, this task is more difficult. The paper proposes a novel framework called SignLLM, which aims to improve the understanding and handling of sign language translation by converting sign language videos into language-like representations using pretrained LLMs. SignLLM consists of two key modules: 1) Vector-Quantized Visual Sign module, which converts sign language videos into discrete character-level gesture token sequences; 2) Codebook Reconstruction and Alignment module, which converts these character-level tokens into word-level gesture representations. In addition, semantic compatibility is enhanced by a sign-text alignment loss that reduces the gap between sign and text tokens. Experimental results show that SignLLM achieves state-of-the-art results without a vocabulary on two popular sign language translation benchmarks. The main contributions of this paper include: 1) proposing the SignLLM framework, which utilizes pretrained and frozen LLMs for sign language translation for the first time; 2) designing the VQ-Sign module to quantize sign language videos into discrete character-level gesture tokens, and the CRA module to convert these tokens into word-level gesture representations; 3) achieving state-of-the-art results without a vocabulary on two popular sign language translation datasets through these designs. The paper investigates how to convert sign language videos into forms with language-like features to leverage the powerful translation ability of LLMs, which is an innovative attempt for this challenging task of sign language translation.

LLMs are Good Sign Language Translators

LLaVA-SLT: Visual Language Tuning for Sign Language Translation

An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

SignLLM: Sign Language Production Large Language Models

Using an LLM to Turn Sign Spottings into Spoken Language Sentences

Diverse Sign Language Translation

SimulSLT: End-to-End Simultaneous Sign Language Translation

Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation

SignLLM: Sign Languages Production Large Language Models

Deep Learning Methods for Sign Language Translation

Hierarchical lstm for sign language translation

Gloss-Free End-to-End Sign Language Translation

Hierarchical LSTM for Sign Language Translation.

MLSLT: Towards Multilingual Sign Language Translation.

Linguistics-Vision Monotonic Consistent Network for Sign Language Production

Scaling Sign Language Translation

MC-SLT: Towards Low-Resource Signer-Adaptive Sign Language Translation

Sign Language Production with Latent Motion Transformer

Natural Language-Assisted Sign Language Recognition

An Explicit Multi-Modal Fusion Method for Sign Language Translation.