Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Mingwang Xu,Hui Li,Qingkun Su,Hanlin Shang,Liwei Zhang,Ce Liu,Jingdong Wang,Yao Yao,Siyu Zhu
2024-06-16
Abstract:The field of portrait image animation, driven by speech audio input, has experienced significant advancements in the generation of realistic and dynamic portraits. This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations within the framework of diffusion-based methodologies. Moving away from traditional paradigms that rely on parametric models for intermediate facial representations, our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module to enhance the precision of alignment between audio inputs and visual outputs, encompassing lip, expression, and pose motion. Our proposed network architecture seamlessly integrates diffusion-based generative models, a UNet-based denoiser, temporal alignment techniques, and a reference network. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities. Through a comprehensive evaluation that incorporates both qualitative and quantitative analyses, our approach demonstrates obvious enhancements in image and video quality, lip synchronization precision, and motion diversity. Further visualization and access to the source code can be found at: <a class="link-external link-https" href="https://fudan-generative-vision.github.io/hallo" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily aims to address the following two key issues: 1. **Synchronization and Coordination of Facial Movements Driven by Audio**: How to precisely synchronize facial movements (including lip movements, facial expressions, and head poses) with the input audio signal and ensure coordination among these movements. 2. **Creation of High-Quality and Visually Appealing Animations**: How to generate animations that are both high-fidelity and temporally coherent, making them not only visually pleasing but also accurately reflective of the speaker's characteristics. To tackle these challenges, the research team proposed an innovative approach called **Hallo**, a hierarchical audio-driven visual synthesis method. This approach leverages the framework of end-to-end diffusion models and introduces a new module to enhance the alignment accuracy between audio inputs and visual outputs (especially lip, expression, and pose movements). Additionally, this method can adaptively control the diversity of expressions and poses, achieving more personalized effects. Specifically, the study employs the following technical means: - **End-to-End Diffusion Model**: Generates high-quality dynamic portrait videos directly from image and audio segments without the need for intermediate representations or complex preprocessing steps. - **Hierarchical Audio-Driven Visual Synthesis Module**: Establishes correspondences between audio and visual features through a cross-attention mechanism, followed by adaptively weighted fusion of these cross-attention results. - **Network Architecture Integration**: Combines diffusion-based generative models, UNet denoisers, temporal alignment techniques for sequence coherence, and reference networks for guiding visual generation. Through a series of quantitative and qualitative analyses, the study demonstrates that the proposed method improves image and video quality, lip synchronization accuracy, and motion diversity. Furthermore, the study emphasizes that its code and sample data will be made available to the open-source community to facilitate further research in the field.