DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

Steven Hogue,Chenxu Zhang,Hamza Daruger,Yapeng Tian,Xiaohu Guo
2024-09-12
Abstract:Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar's animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is some limitations in the existing audio - driven speech video generation methods. Specifically: 1. **Existing methods rely on video - to - video conversion techniques**: These methods usually use traditional generative networks such as GANs, which usually generate the talking head and accompanying gestures separately, resulting in less coherent output. 2. **The generated gestures are too smooth or lack diversity**: Many gesture - centered methods do not integrate the generation of the talking head. 3. **Single - shot generation cannot be achieved**: Existing methods need to rely on specific people in the training data and cannot flexibly generate videos for any person. To overcome these limitations, the paper proposes DiffTED, a new single - shot audio - driven TED - style speech video generation method based on the diffusion model. This method solves the above problems in the following ways: - **Generate key - point sequences using the diffusion model**: These key - points are used to control the Thin - Plate Spline motion model, thereby precisely controlling the animation of virtual characters while ensuring temporal and diverse gestures. - **Do not rely on pre - trained classifiers**: Through classifier - free guidance, gestures can flow naturally with the audio input. - **Achieve single - shot generation**: Generate a complete speech video from a single image and driving audio without retraining the model to adapt to different people. Through these innovations, DiffTED can achieve temporally coherent and diverse accompanying gestures in the generated videos, thereby providing a more natural and realistic speech video generation experience.