Abstract:Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at <a class="link-external link-https" href="https://github.com/XiaoYuanJun-zy/MCDubber" rel="external noopener nofollow">this https URL</a>.

From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning

Learning to Dub Movies Via Hierarchical Prosody Models.

StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Cross-lingual Prosody Transfer for Expressive Machine Dubbing

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

Neural Dubber: Dubbing for Videos According to Scripts

Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

Prosody Modeling with 3D Visual Information for Expressive Video Dubbing

Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation

DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder

IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

Towards Realistic Visual Dubbing with Heterogeneous Sources

Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Dubbing for Everyone: Data-Efficient Visual Dubbing using Neural Rendering Priors

Creating Speech-to-Speech Corpus from Dubbed Series

Duration Modeling of Neural TTS for Automatic Dubbing

Puppet Dubbing