EditHuman: Fine-Grained Text-Driven Human Video Editing

Kaiduo Zhang,Muyi Sun,Junxing Hu,Kunbo Zhang,Zhenan Sun
DOI: https://doi.org/10.1109/ijcb62174.2024.10744427
2024-01-01
Abstract:Recently, video editing has made significant advances. Human character, as one of the core elements in video editing, has attracted great research attention. However, when editing characters with strong structural information, previous methods generally encounter blurring and distortion in the limbs. In this paper, we present EditHuman, a model to realize fine-grained text-driven human video editing tasks, which achieves continuous pose movements and high-quality limb expression. Considering complex body structures and continuity of motion, more precise designs are needed to obtain practical performance. Specifically, we propose a Cascaded UNet (CAU) to realize a coarse-to-fine denoising process and refined noise estimation. Meanwhile, we introduce two Heatmap-Centric Attention Modules called Key-Element Attention (KEA) and Key-Temporal Attention (KTA) to enhance the quality of human limb expression and inter-frame continuity. Moreover, we utilize the estimated heatmap to guide the noise prediction, which further refines the video quality. Extensive experiments show that EditHuman has achieved the SOTA performance.
What problem does this paper attempt to address?