Abstract:This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on <a class="link-external link-https" href="https://choijeongsoo.github.io/av2av" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two main problems in cross - language audio - visual dialogue: 1. **Improvement of multimodal dialogue experience**: - Current automatic speech - to - speech translation (A2A) systems only handle the audio modality and ignore visual information, resulting in problems such as audio - video asynchrony in scenarios like video conferencing. For example, when using an A2A system for video calls, viewers may see a mismatch between facial movements and the heard voice. This inconsistency affects the naturalness of the dialogue and user experience. - By introducing direct audio - visual - to - audio - visual translation (AV2AV), the system can generate synchronized lip movements and translated speech, thus providing a more realistic face - to - face dialogue experience. 2. **Enhancement of system robustness**: - In noisy environments, traditional A2A systems may not be able to accurately translate speech. By leveraging the complementarity of audio and visual information, the AV2AV system can more accurately perform language translation in the presence of background noise, improving system robustness. ### Specific solutions To achieve the above goals, the paper proposes the following solutions: 1. **Unified audio - visual representation**: - Due to the lack of a direct AV2AV parallel dataset, the authors propose using self - supervised learning methods (such as AV - HuBERT) to learn a unified audio - visual representation. In this way, even with only audio data, a model capable of handling multimodal input can be trained. - Specifically, the authors introduce multilingual - trained AV - HuBERT (mA V - HuBERT) and pre - train it on approximately 7,000 hours of multilingual audio - visual data. 2. **Multilingual speech translation model**: - Use the Transformer encoder - decoder architecture to build a multilingual speech translation model. This model achieves direct translation between multiple languages by extracting audio - visual units of the source language and translating them into audio - visual units of the target language. - The training data of the model comes from a large A2A parallel dataset, which contains 19 languages and has a total duration of approximately 12,000 hours. 3. **Zero - shot audio - visual renderer**: - In order to generate the final audio and video output from the translated audio - visual units, the authors design a zero - shot audio - visual renderer (AV - Renderer). This renderer can maintain the speaker's voice and facial features without additional training. - The renderer consists of a length predictor, a vocoder, and a facial renderer, which are respectively responsible for predicting the duration of each audio - visual unit, generating an audio waveform, and synthesizing a facial video. ### Experimental verification - **Dataset**: - The datasets used to train mA V - HuBERT include LRS2, LRS3, VoxCeleb2, mTEDx, and AVSpeech, with a total duration of approximately 7,011 hours. - The datasets used to train the AV2AV language translation model include VoxPopuli and mTEDx, with a total duration of approximately 12,000 hours. - **Evaluation metrics**: - Translation quality is evaluated by BLEU scores, using off - the - shelf automatic speech recognition (ASR) models to transcribe audio. - Video generation quality is evaluated by metrics such as FID, LSE - C, and LSE - D, which measure visual quality and audio - video synchronization respectively. - Subjective evaluations, such as the mean opinion score (MOS) test, are also carried out to evaluate the naturalness of each modality and the authenticity of the video. ### Conclusion The AV2AV framework proposed in this paper performs well in multilingual audio - visual translation tasks. It not only improves the naturalness of the dialogue and user experience but also enhances system robustness. By using unified audio - visual representation and zero - shot rendering techniques, this framework can achieve high - quality multimodal translation in the absence of direct parallel data.

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

AVATAR: Unconstrained Audiovisual Speech Recognition

AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

AVIN-Chat: An Audio-Visual Interactive Chatbot System with Emotional State Tuning

Audiovisual Speech Synthesis using Tacotron2

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

TAVT: Towards Transferable Audio-Visual Text Generation.

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

AVA-AVD: Audio-Visual Speaker Diarization in the Wild