Puppet Dubbing

Ohad Fried,Maneesh Agrawala
DOI: https://doi.org/10.48550/arXiv.1902.04285
2019-02-12
Abstract:Dubbing puppet videos to make the characters (e.g. Kermit the Frog) convincingly speak a new speech track is a popular activity with many examples of well-known puppets speaking lines from films or singing rap songs. But manually aligning puppet mouth movements to match a new speech track is tedious as each syllable of the speech must match a closed-open-closed segment of mouth movement for the dub to be convincing. In this work, we present two methods to align a new speech track with puppet video, one semi-automatic appearance-based and the other fully-automatic audio-based. The methods offer complementary advantages and disadvantages. Our appearance-based approach directly identifies closed-open-closed segments in the puppet video and is robust to low-quality audio as well as misalignments between the mouth movements and speech in the original performance, but requires some manual annotation. Our audio-based approach assumes the original performance matches a closed-open-closed mouth segment to each syllable of the original speech. It is fully automatic, robust to visual occlusions and fast puppet movements, but does not handle misalignments in the original performance. We compare the methods and show that both improve the credibility of the resulting video over simple baseline techniques, via quantitative evaluation and user ratings.
Graphics,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to automatically or semi - automatically align a new audio track with an existing puppet video so that the puppet's mouth movements can naturally match the content of the new audio. Specifically, the researchers proposed two methods to reduce or eliminate the time and effort required for manually adjusting the puppet's mouth movements to match the new audio. ### Problem Background In the current manual process, in order to make the character in the puppet video (such as Kermit the Frog) be able to naturally speak new lines, each syllable needs to be precisely aligned with the closed - open - closed (COC for short) sequence of the puppet's mouth. This process is very time - consuming and cumbersome because each syllable in every new line needs to be matched with the corresponding COC sequence in the video. If the alignment is inaccurate, the audience will feel that the puppet's movements and voice are not in harmony, which will affect the viewing experience. ### Paper Goals To solve this problem, the paper proposes two methods: 1. **Appearance - based semi - automatic method**: This method directly identifies the COC sequence in the puppet video and aligns it with the syllables in the new audio. This method is highly robust to low - quality audio and cases where the lip - sync is not aligned in the original video, but it requires some manual annotation. 2. **Audio - based fully - automatic method**: It is assumed that each syllable in the original performance has been aligned with the COC sequence. This method aligns the new audio by analyzing the syllable boundaries in the original audio. It is fully automated and is highly robust to occlusion and fast movement in the video, but it cannot handle alignment errors in the original performance. ### Main Contributions 1. **Identified the guiding principles of puppet performance**: These principles can produce a convincing puppet performance effect. 2. **Implemented two novel puppet dubbing methods**: One is the appearance - based semi - automatic method, and the other is the audio - based fully - automatic method. 3. **Proposed a new method of combining video and audio re - timing**: This method takes into account the rate of change of the audio speed to minimize the artifacts caused by re - timing. Through these two methods, the paper significantly reduces the workload of manually adjusting the alignment between the puppet video and the new audio, and improves the quality and efficiency of dubbing.