Abstract:Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method. Project Page: <a class="link-external link-https" href="https://grisoon.github.io/INFP/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of how to generate interactive head animations in dyadic conversations. Specifically, the existing audio - driven head generation methods mainly focus on one - sided communication, such as speaking or listening, while ignoring the real two - way interactive characteristics between humans. These methods usually require manual role assignment and explicit role switching, resulting in unnatural transitions and inconsistent interaction states. To solve these problems, this paper proposes a new framework - **INFP (Interactive, Natural, Flash and Person - generic)** for audio - driven interactive head generation in dyadic conversations. Different from the existing methods, INFP does not need to pre - set roles (Speaker or Listener), but can dynamically and naturally switch between speaking and listening states according to the conversation audio. This makes the generated head animations more realistic and natural, and can adapt to various communication states in multi - round conversations. ### Main contributions 1. **Natural role switching**: For the first time, it realizes the automatic generation of natural transitions between speaking and listening states of individuals without manual role assignment or explicit role switching in multi - round conversations. 2. **Innovative motion feature extractor**: An interactive motion guider with a learnable memory bank is designed, which can adaptively extract interaction information from dual - track audio and construct mixed speaking - listening motion features. 3. **Large - scale high - quality dataset DyConv**: A large - scale dataset containing rich dyadic conversations is introduced to promote the research on head generation in dyadic interaction scenarios. 4. **Real - time and general - purpose face generation**: Through extensive experiments and visual verification, the superior performance of this method in real - time and generalization ability for any individual is proved. ### Method overview The INFP framework is divided into two stages: 1. **Motion - Based Head Imitation**: Learn from a large number of real - life conversation videos and compress various conversation behaviors into a low - dimensional motion latent space, and use these latent codes to animate static portrait images. 2. **Audio - Guided Motion Generation**: Map the dual - track conversation audio to the motion latent space pre - trained in the first stage through denoising learning to realize audio - driven interactive head generation. ### Dataset To support this research, the authors propose a large - scale dataset named DyConv, which contains more than 200 hours of multi - round dyadic conversation videos, covering a wide range of emotions and expressions. Compared with the existing datasets, DyConv has a significant improvement in both scale and quality. ### Experimental results Through quantitative and qualitative evaluations, the experimental results show that INFP is superior to the existing methods in multiple indicators, especially in audio - lip synchronization, identity preservation and motion diversity. User studies also further confirm the superior performance of this method in naturalness, motion diversity and audio - visual alignment. In conclusion, INFP provides a brand - new paradigm, making the generated head animations not only more realistic, but also able to naturally adapt to different communication states in dyadic conversations.

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Text-driven Visual Prosody Generation for Embodied Conversational Agents

APB2FaceV2: Real-Time Audio-Guided Multi-Face Reenactment

Dyadic Interaction Modeling for Social Behavior Generation

Audio-driven Talking Face Video Generation with Natural Head Pose

Interactive Conversational Head Generation

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

DisCoHead: Audio-and-Video-Driven Talking Head Generation by Disentangled Control of Head Pose and Facial Expressions

Active Listener: Continuous Generation of Listener's Head Motion Response in Dyadic Interactions

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

in2IN: Leveraging individual Information to Generate Human INteractions