INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Yongming Zhu,Longhao Zhang,Zhengkun Rong,Tianshu Hu,Shuang Liang,Zhipeng Ge
2024-12-05
Abstract:Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method. Project Page: <a class="link-external link-https" href="https://grisoon.github.io/INFP/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of how to generate interactive head animations in dyadic conversations. Specifically, the existing audio - driven head generation methods mainly focus on one - sided communication, such as speaking or listening, while ignoring the real two - way interactive characteristics between humans. These methods usually require manual role assignment and explicit role switching, resulting in unnatural transitions and inconsistent interaction states. To solve these problems, this paper proposes a new framework - **INFP (Interactive, Natural, Flash and Person - generic)** for audio - driven interactive head generation in dyadic conversations. Different from the existing methods, INFP does not need to pre - set roles (Speaker or Listener), but can dynamically and naturally switch between speaking and listening states according to the conversation audio. This makes the generated head animations more realistic and natural, and can adapt to various communication states in multi - round conversations. ### Main contributions 1. **Natural role switching**: For the first time, it realizes the automatic generation of natural transitions between speaking and listening states of individuals without manual role assignment or explicit role switching in multi - round conversations. 2. **Innovative motion feature extractor**: An interactive motion guider with a learnable memory bank is designed, which can adaptively extract interaction information from dual - track audio and construct mixed speaking - listening motion features. 3. **Large - scale high - quality dataset DyConv**: A large - scale dataset containing rich dyadic conversations is introduced to promote the research on head generation in dyadic interaction scenarios. 4. **Real - time and general - purpose face generation**: Through extensive experiments and visual verification, the superior performance of this method in real - time and generalization ability for any individual is proved. ### Method overview The INFP framework is divided into two stages: 1. **Motion - Based Head Imitation**: Learn from a large number of real - life conversation videos and compress various conversation behaviors into a low - dimensional motion latent space, and use these latent codes to animate static portrait images. 2. **Audio - Guided Motion Generation**: Map the dual - track conversation audio to the motion latent space pre - trained in the first stage through denoising learning to realize audio - driven interactive head generation. ### Dataset To support this research, the authors propose a large - scale dataset named DyConv, which contains more than 200 hours of multi - round dyadic conversation videos, covering a wide range of emotions and expressions. Compared with the existing datasets, DyConv has a significant improvement in both scale and quality. ### Experimental results Through quantitative and qualitative evaluations, the experimental results show that INFP is superior to the existing methods in multiple indicators, especially in audio - lip synchronization, identity preservation and motion diversity. User studies also further confirm the superior performance of this method in naturalness, motion diversity and audio - visual alignment. In conclusion, INFP provides a brand - new paradigm, making the generated head animations not only more realistic, but also able to naturally adapt to different communication states in dyadic conversations.