StyleLipSync: Style-based Personalized Lip-sync Video Generation

Taekyung Ki,Dongchan Min
2024-02-12
Abstract:In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lip-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### The Problem This Paper Attempts to Solve This paper aims to address the issue of generating high-fidelity, naturally lip-synced videos. Specifically: 1. **Natural Lip Sync**: Existing lip-sync methods produce unnatural jaw movements and visual artifacts when dealing with dynamic head poses. This paper proposes a pose-aware mask-based method to improve this issue. 2. **High-Resolution Video Generation**: Many existing methods generate videos at low resolutions (e.g., 96×96), resulting in poor image quality. The proposed method can directly generate high-fidelity lip-synced videos at a resolution of 256×256. 3. **Temporal Consistency**: Existing frame-by-frame independent encoding methods lead to incoherent mouth movements in the final generated video. This paper introduces a Moving-average based Latent Smoothing (MaLS) module to enhance temporal consistency. 4. **Personalized Adaptation**: For unseen faces, existing models may fail to preserve identity features well. This paper proposes a few-shot adaptation method that enhances personalized information by fine-tuning the decoder and applying sync regularization on the training data audio. With these improvements, the paper aims to generate high-quality lip-synced videos in a zero-shot setting and further enhance the personalized features of unseen faces with a small amount of target video.