Abstract:Audio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the applications in portrait animation. The code will be available at: <a class="link-external link-https" href="https://jdhalgo.github.io/JoyVASA" rel="external noopener nofollow">this https URL</a>.

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

APB2FaceV2: Real-Time Audio-Guided Multi-Face Reenactment

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Photorealistic Audio-driven Video Portraits

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

MyPortrait: Morphable Prior-Guided Personalized Portrait Generation

AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections

MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Audio-driven Talking Face Video Generation with Natural Head Pose

GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

Animating Portrait Line Drawings from a Single Face Photo and a Speech Signal

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization

LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement

PIRenderer: Controllable Portrait Image Generation Via Semantic Neural Rendering

Deep video portraits