MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Yunfei Liu,Lijian Lin,Fei Yu,Changyin Zhou,Yu Li
DOI: https://doi.org/10.48550/arXiv.2307.10008
2023-07-19
Abstract:Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided renderer syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem Addressed by the Paper The paper aims to address the issue of audio-driven portrait animation generation. Specifically, its goal is to synthesize high-fidelity, multimodal portrait videos that are synchronized with a given audio signal. Previous methods have attempted to capture different motion patterns and generate high-fidelity portrait videos by training different models or sampling signals from existing videos. However, these methods often overlook the correlation between lip synchronization and other movements (such as head poses and blinking), resulting in unnatural outcomes. To this end, the paper proposes a unified system—Mapping Once Network with Dual Attention (MODA)—for generating high-quality talking portraits for multiple people with diverse characteristics. The method comprises three stages: 1. **Mapping Once Network (MODA)**: Uses a dual attention module to generate talking representations from the given audio. MODA designs a dual attention module to encode precise mouth movements and diverse modalities. 2. **Facial Composer Network**: Generates densely detailed facial feature points. 3. **Temporal-Guided Renderer**: Synthesizes stable videos. Extensive evaluations show that the proposed system generates more natural and realistic video portraits compared to previous methods.