Abstract:In this paper, we propose a neural end-to-end system for voice preserving, lip-synchronous translation of videos. The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language that is lip-synchronous with the target speech, yet maintains emphases in speech, voice characteristics, face video of the original speaker. The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model. The translated text is then synthesized by a Text-to-Speech model that recreates the original emphases mapped from the original sentence. The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model. Finally, to synchronize the lips of the speaker with the translated audio, a conditional generative adversarial network-based model generates frames of adapted lip movements with respect to the input face image as well as the output of the voice conversion model. In the end, the system combines the generated video with the converted audio to produce the final output. The result is a video of a speaker speaking in another language without actually knowing it. To evaluate our design, we present a user study of the complete system as well as separate evaluations of the single components. Since there is no available dataset to evaluate our whole system, we collect a test set and evaluate our system on this test set. The results indicate that our system is able to generate convincing videos of the original speaker speaking the target language while preserving the original speaker's characteristics. The collected dataset will be shared.

A Modularized Neural Network with Language-Specific Output Layers for Cross-lingual Voice Conversion

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Building Multi lingual TTS using Cross Lingual Voice Conversion

Cross-lingual Voice Conversion with Disentangled Universal Linguistic Representations

Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Deep Neural Network Based Voice Conversion with A Large Synthesized Parallel Corpus

Voice Conversion Challenge 2020: Intra-lingual Semi-Parallel and Cross-Lingual Voice Conversion

Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion

Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

The NUS & NWPU System for Voice Conversion Challenge 2020

Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

Spectro-Temporal Modelling with Time-Frequency Lstm and Structured Output Layer for Voice Conversion

ON MODULAR TRAINING OF NEURAL ACOUSTICS-TO-WORD MODEL FOR LVCSR

A noise-robust voice conversion method with controllable background sounds

Phone-aware LSTM-RNN for Voice Conversion

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion