Abstract:Latent diffusion models have made great strides in generating expressive portrait videos with accurate lip-sync and natural motion from a single reference image and audio input. However, these models are far from real-time, often requiring many sampling steps that take minutes to generate even one second of video-significantly limiting practical use. We introduce OSA-LCM (One-Step Avatar Latent Consistency Model), paving the way for real-time diffusion-based avatars. Our method achieves comparable video quality to existing methods but requires only one sampling step, making it more than 10x faster. To accomplish this, we propose a novel avatar discriminator design that guides lip-audio consistency and motion expressiveness to enhance video quality in limited sampling steps. Additionally, we employ a second-stage training architecture using an editing fine-tuned method (EFT), transforming video generation into an editing task during training to effectively address the temporal gap challenge in single-step generation. Experiments demonstrate that OSA-LCM outperforms existing open-source portrait video generation models while operating more efficiently with a single sampling step.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that although the existing generation methods based on diffusion models can achieve high - quality lip - synchronization and natural movements when generating expressive portrait videos, they are far from achieving real - time performance. These models usually require multiple sampling steps, and it may take several minutes to generate a one - second video, which greatly limits their practical applications. Specifically, the paper proposes a new method named OSA - LCM (One - Step Avatar Latent Consistency Model), aiming to achieve real - time high - quality portrait video generation with only one sampling step. Compared with the existing methods, OSA - LCM not only significantly improves the generation speed (more than 10 times), but also maintains similar video quality. ### Main problems: 1. **Slow generation speed**: The existing diffusion models need multiple sampling steps to generate portrait videos, resulting in too long generation time. 2. **Poor real - time performance**: Due to the long generation time, it cannot meet the requirements of real - time application scenarios. 3. **Video quality degradation**: When reducing the sampling steps to improve the speed, the video quality will be significantly degraded, especially in single - step generation, blurring or artifacts are likely to occur. ### Solutions: To overcome the above problems, the paper proposes the following solutions: 1. **Introducing the OSA - LCM model**: By designing a new discriminator to guide lip - synchronization and motion expression, the video quality is enhanced within a limited number of sampling steps. 2. **Two - stage training architecture**: - **First stage**: Use the Adversarial Latent Consistency Model (Adv - LCM) for training. By combining the consistency loss and the adversarial loss, it can generate high - quality portrait videos within two sampling steps. - **Second stage**: Adopt the Editing Fine - Tuning method (EFT), transform the video generation task into an editing task, and further solve the time - interval challenge in single - step generation, so that OSA - LCM can generate high - quality videos within only one sampling step. ### Experimental results: Experiments show that under single - step diffusion sampling, OSA - LCM can generate a one - second video in nearly one second, while maintaining similar effects in both quantitative and qualitative aspects as multi - step generation methods. This makes OSA - LCM have significant advantages in practical applications, especially in scenarios with high requirements for real - time performance. ### Formula representation: Some of the key formulas involved in the paper are as follows: - The forward process of the diffusion model: \[ x_t = q(x_0, \epsilon, t) = \alpha_t x_0 + \beta_t \epsilon, \quad \epsilon \sim N(0, I) \] - The parameterized form of the consistency model: \[ f_\theta(x_t, t) = c_{\text{skip}}(t)x_t + c_{\text{out}}(t)F_\theta(x_t, t) \] - The adversarial loss function: \[ L_{\text{adv}}(\phi, \hat{x}_{\Delta t}; \varphi) = \text{ReLU}(1 - D_\varphi(\epsilon_\phi(\hat{x}_{\Delta t}))) \] Through these improvements, OSA - LCM has successfully solved the speed and quality problems existing in the existing methods and provided a new solution for real - time portrait video generation.

Real-time One-Step Diffusion-based Expressive Portrait Videos Generation

LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

OSV: One Step is Enough for High-Quality Image to Video Generation

StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

AniFaceDiff: Animating Stylized Avatars via Parametric Conditioned Diffusion Models

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

DynamicAvatars: Accurate Dynamic Facial Avatars Reconstruction and Precise Editing with Diffusion Models

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

Expressive 3D Facial Animation Generation Based on Local-to-Global Latent Diffusion