Abstract:This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on <a class="link-external link-https" href="https://choijeongsoo.github.io/av2av" rel="external noopener nofollow">this https URL</a>.

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Jointly Learning Visual and Auditory Speech Representations from Raw Data

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Deep Audio-visual System for Closed-set Word-level Speech Recognition

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

For end-to-end audio-visual speech recognition

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition