Abstract:Visual dubbing uses visual computing and deep learning to alter the lip and mouth articulations of the actor to sync with the dubbed speech. It has the potential to greatly improve the content generated from the dubbing industry. Quality of the dubbed result is primary for the industry. An important requirement is that visual lip sync changes be localized to the mouth region and not affect the rest of the actor's face or the rest of the video frame. Current methods can create realistic looking fake faces with expressions. However, many fail to localize lip sync and have quality problems such as identity loss, low-res, blurs, face skin feature or colour loss, and temporal jitter. These problems mainly arise because end-to-end training of networks to correctly disentangle these different visual dubbing parameters (pose, skin colour, identity, lip movements, etc.) is very difficult to achieve. Our main contribution is a new visual dubbing pipeline, in which, instead of end-to-end training we apply incrementally different disentangling techniques for each parameter. Our pipeline is composed of three main steps: pose alignment, identity transfer and video reassembly. Expert models in each step are fine-tuned for the actor. We propose an identity transfer network with an added style block, which with pre-training is able to decouple face components, specifically identity and expression, and also works with short video clips like TV ads. Our pipeline also includes novel stages related to temporal smoothing of the reenacted face, actor specific super resolution to retain fine facial details, and a second pass through the identity transfer network for preserving actor identity. Localization of lip-sync is achieved by restricting changes in the original video frame to just the actor's mouth region. The results are convincing, and a user survey also confirms their quality. Relevant quantitative metrics are included.

Japanese-to-English Simultaneous Dubbing Prototype

SimulS2S: End-to-End Simultaneous Speech to Speech Translation

A Prototype Automatic Simultaneous Interpretation System.

Multilingual Video Viewing Subtitle to Audio Translator

MeetDot: Videoconferencing with Live Translation Captions

Simultaneous Speech Translation for Live Subtitling: from Delay to Display

Predictive Simultaneous Interpretation: Harnessing Large Language Models for Democratizing Real-Time Multilingual Communication

Presenting Simultaneous Translation in Limited Space

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

Tagged End-to-End Simultaneous Speech Translation Training using Simultaneous Interpretation Data

Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder

Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation

Neural Dubber: Dubbing for Videos According to Scripts

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Re-translation versus Streaming for Simultaneous Translation

SimulEval: An Evaluation Toolkit for Simultaneous Translation

Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

An Efficient and Effective Online Sentence Segmenter for Simultaneous Interpretation.

Visual dubbing pipeline with localized lip-sync and two-pass identity transfer