Abstract:Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, \textbf{TransFace}, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-to-unit translation model to convert audio speech into discrete units and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize synchronized audio-visual speech from discrete units in parallel. Furthermore, we introduce a Bounded Duration Predictor, ensuring isometric talking head translation and preventing duplicate reference frames. Experiments demonstrate that our proposed Unit2Lip model significantly improves synchronization (1.601 and 0.982 on LSE-C for the original and generated audio speech, respectively) and boosts inference speed by a factor of 4.35 on LRS2. Additionally, TransFace achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on LRS3-T and 100% isochronous translations.

Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters

Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing

Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

IsoChronoMeter: A simple and effective isochronic translation evaluation metric

Enriching the Transformer with Linguistic Factors for Low-Resource Machine Translation

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

Duration Modeling of Neural TTS for Automatic Dubbing

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

Fixed and Adaptive Simultaneous Machine Translation Strategies Using Adapters

Learning to Count Words in Fluent Speech enables Online Speech Recognition

Isometric MT: Neural Machine Translation for Automatic Dubbing

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Combining Many Alignments for Speech to Speech Translation

Enhancing Speech-to-Speech Translation with Multiple TTS Targets

On Efficient Coupling of ASR and SMT for Speech Translation

Accelerating Transducers through Adjacent Token Merging

Streaming Punctuation for Long-form Dictation with Transformers