Abstract:Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, \textbf{TransFace}, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-to-unit translation model to convert audio speech into discrete units and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize synchronized audio-visual speech from discrete units in parallel. Furthermore, we introduce a Bounded Duration Predictor, ensuring isometric talking head translation and preventing duplicate reference frames. Experiments demonstrate that our proposed Unit2Lip model significantly improves synchronization (1.601 and 0.982 on LSE-C for the original and generated audio speech, respectively) and boosts inference speed by a factor of 4.35 on LRS2. Additionally, TransFace achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on LRS3-T and 100% isochronous translations.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key challenges in cross - language speaker head translation: 1. **Slow Inference Speed and Cascade Errors**: Existing methods usually rely on cascade models, that is, synthesizing speaker heads through text and voice, which leads to slow inference speed and easy accumulation of errors. 2. **Lack of Parallel Corpus Data**: Obtaining visual corpus data is more difficult than audio corpus, especially constructing a parallel visual corpus pair dataset required for direct speaker head translation. 3. **Fixed Number of Reference Frames**: The translation result may be too long and need to reuse the reference frames, resulting in unnatural video transitions. To solve these problems, the paper proposes a direct speaker head translation framework **TransFace**, which contains the following components: - **Speech - to - Unit Translation Model (S2UT)**: Converts the audio speech in the source language into discrete units in the target language. - **Unit - Based Audio - Video Speech Synthesizer (Unit2Lip)**: Synthesizes synchronized audio - video speech in parallel from discrete units. - **Bounded Duration Predictor**: Ensures that the generated audio - video speech has the same length as the original audio - video speech, avoiding unnatural transitions caused by reusing reference frames. Specifically, the main contributions of TransFace include: - **Proposing a direct speaker head translation framework for the first time**, without relying on audio speech and text, effectively avoiding the slowdown and cumulative errors brought by model cascading. - **Proposing a unit - based audio - video speech synthesizer for the first time**, which can keep synchronization while synthesizing audio - video speech in parallel, achieving a 4.35 - fold inference speedup. - **Proposing a bounded duration predictor**, achieving 100% equal - length translation, which is especially important for streaming translation scenarios, and at the same time effectively avoiding unnatural video transitions and improving the acceptance of translation results. Experimental results show that Unit2Lip has a significant improvement in synchronization (on the LRS2 dataset, the LSE - C of the original speech is improved from 1.601 to 0.982), and the BLEU scores of TransFace on the LRS3 - T dataset are 61.93 and 47.55 (Spanish - to - English and French - to - English respectively).

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Audio-driven Talking Face Video Generation with Natural Head Pose

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Towards Automatic Face-to-Face Translation

Talking Faces: Audio-to-Video Face Generation

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation