SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model

Yan Li,Ziya Zhou,Zhiqiang Wang,Wei Xue,Wenhan Luo,Yike Guo
2024-12-05
Abstract:Recent advancements in generative models have significantly enhanced talking face video generation, yet singing video generation remains underexplored. The differences between human talking and singing limit the performance of existing talking face video generation models when applied to singing. The fundamental differences between talking and singing-specifically in audio characteristics and behavioral expressions-limit the effectiveness of existing models. We observe that the differences between singing and talking audios manifest in terms of frequency and amplitude. To address this, we have designed a multi-scale spectral module to help the model learn singing patterns in the spectral domain. Additionally, we develop a spectral-filtering module that aids the model in learning the human behaviors associated with singing audio. These two modules are integrated into the diffusion model to enhance singing video generation performance, resulting in our proposed model, SINGER. Furthermore, the lack of high-quality real-world singing face videos has hindered the development of the singing video generation community. To address this gap, we have collected an in-the-wild audio-visual singing dataset to facilitate research in this area. Our experiments demonstrate that SINGER is capable of generating vivid singing videos and outperforms state-of-the-art methods in both objective and subjective evaluations.
Computer Vision and Pattern Recognition,Machine Learning,Sound
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the poor performance of current audio - driven face generation techniques in generating singing videos. Specifically, existing dialogue - face - video - generation models are limited when dealing with singing tasks because there are significant differences in audio characteristics and behavioral expressions between singing and ordinary dialogue. These differences lead to the unsatisfactory effect of existing models in generating singing videos, unable to capture the complex patterns and rich expression changes unique to singing. Therefore, the paper proposes a new multi - scale spectral diffusion model - SINGER, aiming to improve the generation quality of singing videos through specially designed modules, making them more vivid and realistic. ### Problems Solved in the Paper 1. **Differences between Singing Audio and Dialogue Audio**: - Singing audio is more complex in frequency and amplitude than dialogue audio, which makes existing dialogue - face - generation models perform poorly when dealing with singing tasks. - The paper captures and processes these complex spectral features by introducing the Multi - scale Spectral Module (MSM) and the Self - adaptive Filter Module (SFM). 2. **Lack of High - Quality Singing - Video Datasets**: - The lack of high - quality singing - video datasets has seriously hindered the development of singing - video - generation techniques. - The paper has collected a high - quality in - the - wild singing - video dataset named SingingHead Videos (SHV), which contains more than 200 subjects and has a total duration of about 20 hours, providing valuable resources for research. 3. **Diversity and Synchronization in Generating Singing Videos**: - Existing methods often lack diversity and synchronization with audio when generating singing videos. - SINGER improves the diversity and synchronization with audio of the generated videos through the multi - scale spectral module and the self - adaptive filter module, and the generated videos are more vivid and natural. ### Main Contributions 1. **Multi - scale Spectral Module (MSM)**: - It uses wavelet transform to decompose singing audio into multiple sub - bands, each representing a different frequency level. - By assigning adjustable weights to these sub - bands, it highlights key frequency patterns, thus generating more realistic singing videos. 2. **Self - adaptive Filter Module (SFM)**: - It dynamically identifies and enhances the behavioral patterns extracted from audio, ensuring that the generated video is highly consistent with the input audio. - Through self - adaptive filtering, it improves the naturalness and coherence of the generated video. 3. **High - Quality Singing - Video Dataset (SHV)**: - It has collected and organized a high - quality in - the - wild singing - video dataset, filling the gap in the current research field and providing an important resource for the research of singing - video generation. ### Experimental Results The paper comprehensively evaluated SINGER through multiple evaluation metrics (such as FVD, CPBD, PSNR, SSIM, LMD, LSE - D, LSE - C, diversity, BAS, etc.) and compared it with multiple baseline methods. The experimental results show that SINGER is superior to existing baseline methods in terms of the quality of generated singing videos, lip - synchronization, head - movement, and diversity. In conclusion, through proposing the SINGER model and collecting a high - quality singing - video dataset, this paper effectively solves the deficiencies of existing techniques in generating singing videos, laying the foundation for further development in this field.