Abstract:This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on <a class="link-external link-https" href="https://choijeongsoo.github.io/av2av" rel="external noopener nofollow">this https URL</a>.

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

Sequential Contrastive Audio-Visual Learning

EquiMod: An Equivariance Module to Improve Self-Supervised Learning

Contrastive Learning Via Equivariant Representation

The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning

Audio-Visual Class-Incremental Learning

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

Equivariant Representation Learning for Augmentation-based Self-Supervised Learning via Image Reconstruction

Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Siamese Vision Transformers are Scalable Audio-visual Learners

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Leave-One-EquiVariant: Alleviating invariance-related information loss in contrastive music representations

Enhancing Contrastive Learning Inspired by the Philosophy of "The Blind Men and the Elephant"

Audio-Visual Contrastive Learning with Temporal Self-Supervision

Equivariant Similarity for Vision-Language Foundation Models

Contrastive Audio-Visual Masked Autoencoder

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Rethinking the visual cues in audio-visual speaker extraction