DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

Steven Hogue,Chenxu Zhang,Hamza Daruger,Yapeng Tian,Xiaohu Guo

2024-09-12

Abstract:Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar's animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is some limitations in the existing audio - driven speech video generation methods. Specifically: 1. **Existing methods rely on video - to - video conversion techniques**: These methods usually use traditional generative networks such as GANs, which usually generate the talking head and accompanying gestures separately, resulting in less coherent output. 2. **The generated gestures are too smooth or lack diversity**: Many gesture - centered methods do not integrate the generation of the talking head. 3. **Single - shot generation cannot be achieved**: Existing methods need to rely on specific people in the training data and cannot flexibly generate videos for any person. To overcome these limitations, the paper proposes DiffTED, a new single - shot audio - driven TED - style speech video generation method based on the diffusion model. This method solves the above problems in the following ways: - **Generate key - point sequences using the diffusion model**: These key - points are used to control the Thin - Plate Spline motion model, thereby precisely controlling the animation of virtual characters while ensuring temporal and diverse gestures. - **Do not rely on pre - trained classifiers**: Through classifier - free guidance, gestures can flow naturally with the audio input. - **Achieve single - shot generation**: Generate a complete speech video from a single image and driving audio without retraining the model to adapt to different people. Through these innovations, DiffTED can achieve temporally coherent and diverse accompanying gestures in the generated videos, thereby providing a more natural and realistic speech video generation experience.

DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

Text-based Talking Video Editing with Cascaded Conditional Diffusion

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation