Real-time One-Step Diffusion-based Expressive Portrait Videos Generation

Hanzhong Guo,Hongwei Yi,Daquan Zhou,Alexander William Bergman,Michael Lingelbach,Yizhou Yu
2024-12-18
Abstract:Latent diffusion models have made great strides in generating expressive portrait videos with accurate lip-sync and natural motion from a single reference image and audio input. However, these models are far from real-time, often requiring many sampling steps that take minutes to generate even one second of video-significantly limiting practical use. We introduce OSA-LCM (One-Step Avatar Latent Consistency Model), paving the way for real-time diffusion-based avatars. Our method achieves comparable video quality to existing methods but requires only one sampling step, making it more than 10x faster. To accomplish this, we propose a novel avatar discriminator design that guides lip-audio consistency and motion expressiveness to enhance video quality in limited sampling steps. Additionally, we employ a second-stage training architecture using an editing fine-tuned method (EFT), transforming video generation into an editing task during training to effectively address the temporal gap challenge in single-step generation. Experiments demonstrate that OSA-LCM outperforms existing open-source portrait video generation models while operating more efficiently with a single sampling step.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that although the existing generation methods based on diffusion models can achieve high - quality lip - synchronization and natural movements when generating expressive portrait videos, they are far from achieving real - time performance. These models usually require multiple sampling steps, and it may take several minutes to generate a one - second video, which greatly limits their practical applications. Specifically, the paper proposes a new method named OSA - LCM (One - Step Avatar Latent Consistency Model), aiming to achieve real - time high - quality portrait video generation with only one sampling step. Compared with the existing methods, OSA - LCM not only significantly improves the generation speed (more than 10 times), but also maintains similar video quality. ### Main problems: 1. **Slow generation speed**: The existing diffusion models need multiple sampling steps to generate portrait videos, resulting in too long generation time. 2. **Poor real - time performance**: Due to the long generation time, it cannot meet the requirements of real - time application scenarios. 3. **Video quality degradation**: When reducing the sampling steps to improve the speed, the video quality will be significantly degraded, especially in single - step generation, blurring or artifacts are likely to occur. ### Solutions: To overcome the above problems, the paper proposes the following solutions: 1. **Introducing the OSA - LCM model**: By designing a new discriminator to guide lip - synchronization and motion expression, the video quality is enhanced within a limited number of sampling steps. 2. **Two - stage training architecture**: - **First stage**: Use the Adversarial Latent Consistency Model (Adv - LCM) for training. By combining the consistency loss and the adversarial loss, it can generate high - quality portrait videos within two sampling steps. - **Second stage**: Adopt the Editing Fine - Tuning method (EFT), transform the video generation task into an editing task, and further solve the time - interval challenge in single - step generation, so that OSA - LCM can generate high - quality videos within only one sampling step. ### Experimental results: Experiments show that under single - step diffusion sampling, OSA - LCM can generate a one - second video in nearly one second, while maintaining similar effects in both quantitative and qualitative aspects as multi - step generation methods. This makes OSA - LCM have significant advantages in practical applications, especially in scenarios with high requirements for real - time performance. ### Formula representation: Some of the key formulas involved in the paper are as follows: - The forward process of the diffusion model: \[ x_t = q(x_0, \epsilon, t) = \alpha_t x_0 + \beta_t \epsilon, \quad \epsilon \sim N(0, I) \] - The parameterized form of the consistency model: \[ f_\theta(x_t, t) = c_{\text{skip}}(t)x_t + c_{\text{out}}(t)F_\theta(x_t, t) \] - The adversarial loss function: \[ L_{\text{adv}}(\phi, \hat{x}_{\Delta t}; \varphi) = \text{ReLU}(1 - D_\varphi(\epsilon_\phi(\hat{x}_{\Delta t}))) \] Through these improvements, OSA - LCM has successfully solved the speed and quality problems existing in the existing methods and provided a new solution for real - time portrait video generation.