TellMeTalk: Multimodal-driven talking face video generation
Pengfei Li,Huihuang Zhao,Qingyun Liu,Peng Tang,Lin Zhang
DOI: https://doi.org/10.1016/j.compeleceng.2023.109049
IF: 4.152
2024-01-22
Computers & Electrical Engineering
Abstract:In this paper, we present TellMeTalk, an innovative approach for generating expressive talking face videos based on multimodal inputs. Our approach demonstrates robustness across various identities, languages, expressions, and head movements. It overcomes four key limitations of existing talking face video generation methods: (1) reliance on single-modal learning from audio or text, lacking the complementary nature of multimodal inputs; (2) deployment of traditional convolutional neural network generation, leading to restricted capture of spatial features; (3) the absence of natural head movements and expressions; and (4) limitations of artifacts, prominent boundaries caused by image overlapping , and unclear mouth regions . To address these challenges, we propose a face motion network to imbue character images with facial expressions and head movements. We also take text and reference audio as input to generate personalized audio. Furthermore, we introduce a generator equipped with a cross-attention module and Fast Fourier Convolutional blocks to model spatial dependencies. Finally, a face restoration module is designed to reduce artifacts and prominent boundaries. Extensive experiments demonstrate our method produces high-quality expressive talking face videos. Compared to state-of-the-art approaches, our method exhibits superior performance in terms of video quality and precise synchronization of lip movements. The source code is available at https://github.com/lifemo/TellMeTalk .
engineering, electrical & electronic,computer science, interdisciplinary applications, hardware & architecture