Abstract:Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing methods usually handle the two tasks separately when generating co - speech gestures and expressive talking heads. Most methods either focus on only one of the tasks or use independent models or network modules to handle both tasks simultaneously, which increases the training complexity and ignores the intrinsic connection between facial and body movements. Specifically: 1. **Task Separation**: Existing methods usually generate co - speech gestures and talking heads separately, resulting in the need to train and use two independent networks, increasing hardware costs and development time. 2. **Lack of Correlation Modeling**: When only modeling one task, the model may ignore the weak correlation between gestures and facial movements, thus affecting the authenticity and expressiveness of the generated results. 3. **Redundant Parameters**: Using two independent or connected networks will lead to an increase in the number of parameters and occupy more memory resources. To solve these problems, this paper proposes a novel model architecture that jointly generates facial and body movements through a single network. This method utilizes shared weights and adapter modules to enable the model to adapt in a common latent space, thereby effectively reducing the number of parameters and improving the generation effect. ### Main Contributions 1. **Introduction of Adapter Modules**: By using adapter modules to combine co - speech gestures and expressive talking heads into a single network, the required number of parameters is significantly reduced. 2. **Cross - Modal Information Sharing**: The adapter module allows gestures and facial movements to influence each other, taking advantage of the weak correlation between them. 3. **Application of Diffusion Models**: The method based on diffusion models achieves state - of - the - art generation effects both quantitatively and qualitatively, and its superiority in authenticity and credibility has been verified through user studies. Through these improvements, this method can not only generate high - quality co - speech gestures and expressive talking heads, but also effectively reduce the number of model parameters and improve computational efficiency.

Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

Audio-driven Talking Face Video Generation with Natural Head Pose

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

EMoG: Synthesizing Emotive Co-speech 3D Gesture with Diffusion Model

Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation