Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters

Steven Hogue,Chenxu Zhang,Yapeng Tian,Xiaohu Guo
2024-12-19
Abstract:Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing methods usually handle the two tasks separately when generating co - speech gestures and expressive talking heads. Most methods either focus on only one of the tasks or use independent models or network modules to handle both tasks simultaneously, which increases the training complexity and ignores the intrinsic connection between facial and body movements. Specifically: 1. **Task Separation**: Existing methods usually generate co - speech gestures and talking heads separately, resulting in the need to train and use two independent networks, increasing hardware costs and development time. 2. **Lack of Correlation Modeling**: When only modeling one task, the model may ignore the weak correlation between gestures and facial movements, thus affecting the authenticity and expressiveness of the generated results. 3. **Redundant Parameters**: Using two independent or connected networks will lead to an increase in the number of parameters and occupy more memory resources. To solve these problems, this paper proposes a novel model architecture that jointly generates facial and body movements through a single network. This method utilizes shared weights and adapter modules to enable the model to adapt in a common latent space, thereby effectively reducing the number of parameters and improving the generation effect. ### Main Contributions 1. **Introduction of Adapter Modules**: By using adapter modules to combine co - speech gestures and expressive talking heads into a single network, the required number of parameters is significantly reduced. 2. **Cross - Modal Information Sharing**: The adapter module allows gestures and facial movements to influence each other, taking advantage of the weak correlation between them. 3. **Application of Diffusion Models**: The method based on diffusion models achieves state - of - the - art generation effects both quantitatively and qualitatively, and its superiority in authenticity and credibility has been verified through user studies. Through these improvements, this method can not only generate high - quality co - speech gestures and expressive talking heads, but also effectively reduce the number of model parameters and improve computational efficiency.