Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

William Brannon,Yogesh Virkar,Brian Thompson
DOI: https://doi.org/10.1162/tacl_a_00551
2022-12-23
Abstract:We investigate how humans perform the task of dubbing video content from one language into another, leveraging a novel corpus of 319.57 hours of video from 54 professionally produced titles. This is the first such large-scale study we are aware of. The results challenge a number of assumptions commonly made in both qualitative literature on human dubbing and machine-learning literature on automatic dubbing, arguing for the importance of vocal naturalness and translation quality over commonly emphasized isometric (character length) and lip-sync constraints, and for a more qualified view of the importance of isochronic (timing) constraints. We also find substantial influence of the source-side audio on human dubs through channels other than the words of the translation, pointing to the need for research on ways to preserve speech characteristics, as well as semantic transfer such as emphasis/emotion, in automatic dubbing systems.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: How can we understand what humans actually do when performing cross - language dubbing of video content (that is, dubbing from one language to another) through large - scale empirical research, thereby challenging some common assumptions in the existing literature regarding human dubbing and automatic dubbing, and providing guidance for future research on automatic dubbing systems. Specifically, the paper focuses on the following aspects: 1. **Isochrony**: Do dubbers abide by the time constraints imposed by the video and the original audio? 2. **Isometry**: Are the number of characters in the original text and the dubbed text approximately the same? 3. **Speech Tempo**: In order to meet the time constraints, will dubbing actors change their speaking speed, which may possibly affect the naturalness of the voice? 4. **Lip Sync**: How well do the lines of the dubbing actors match the visible mouth movements of the original actors? 5. **Translation Quality**: How much will dubbers reduce the translation accuracy (i.e., accuracy and fluency) in order to meet other constraints? 6. **Source Influence**: Do the phonetic features of the source language affect the target language in a way that is independent of the lines, indicating the importance of emotion transfer? Through the exploration of these issues, the paper aims to provide new insights for the improvement of automatic dubbing systems, especially on how to better preserve voice characteristics (such as emphasis / emotion) and semantic transfer.