Audio–video syncing with lip movements using generative deep neural networks

Amal Mathew,Aaryl Saldanha,C. Narendra Babu
DOI: https://doi.org/10.1007/s11042-024-18695-x
IF: 2.577
2024-03-12
Multimedia Tools and Applications
Abstract:As the metaverse unfolds, the synchronization of audio with video in real-time becomes critical. Many models such as Wav2Lip, Sync Net, and Lip Gan, have been developed to sync audio–video to render high-impact content. Choosing an appropriate loss function has a direct impact on the results and accuracy of audio–video synching. With models like Wav2Lip, enhanced by the Huber Loss function, emerging as frontrunners in the arena. This paper delves into a comprehensive comparative analysis, demonstrating that Huber Loss outperforms L1, L2, and SmoothL1 losses in the efficiency of convergence and quality of synchronization. The empirical results unequivocally advocate for the integration of Huber Loss into the Wav2Lip model, highlighting its capacity to yield a more coherent and natural integration of lip movements with audio. Experimental results reveal that Huber Loss achieves an average training loss of 0.00091 and an evaluation loss of 0.00141 over 61,500 steps, alongside a markedly lower sync loss of 2.20669. These results represent a substantial enhancement in synchronization accuracy, with improvements ranging from 20 to 30% over contemporary loss functions.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?