MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Yunfei Liu,Lijian Lin,Fei Yu,Changyin Zhou,Yu Li

DOI: https://doi.org/10.48550/arXiv.2307.10008

2023-07-19

Abstract:Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided renderer syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem Addressed by the Paper The paper aims to address the issue of audio-driven portrait animation generation. Specifically, its goal is to synthesize high-fidelity, multimodal portrait videos that are synchronized with a given audio signal. Previous methods have attempted to capture different motion patterns and generate high-fidelity portrait videos by training different models or sampling signals from existing videos. However, these methods often overlook the correlation between lip synchronization and other movements (such as head poses and blinking), resulting in unnatural outcomes. To this end, the paper proposes a unified system—Mapping Once Network with Dual Attention (MODA)—for generating high-quality talking portraits for multiple people with diverse characteristics. The method comprises three stages: 1. **Mapping Once Network (MODA)**: Uses a dual attention module to generate talking representations from the given audio. MODA designs a dual attention module to encode precise mouth movements and diverse modalities. 2. **Facial Composer Network**: Generates densely detailed facial feature points. 3. **Temporal-Guided Renderer**: Synthesizes stable videos. Extensive evaluations show that the proposed system generates more natural and realistic video portraits compared to previous methods.

MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Audio-driven Talking Face Video Generation with Natural Head Pose

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Multimodal Learning for Temporally Coherent Talking Face Generation with Articulator Synergy

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions

LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

GMTalker: Gaussian Mixture-based Audio-Driven Emotional talking video Portraits

Photorealistic Audio-driven Video Portraits

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Audio-Driven Emotional Video Portraits

LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement

MyPortrait: Morphable Prior-Guided Personalized Portrait Generation

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Audio-Driven Emotional 3D Talking-Head Generation