DF-CLIP: Towards Disentangled and Fine-grained Image Editing from Text

Haotian Hu,Xinjiao Zhou,Bin Jiang,Chao Yang,Xiaofei Huo
DOI: https://doi.org/10.1109/ICME55011.2023.00106
2023-07-01
Abstract:Inspired by CLIP’s excellent image/text representation capability and StyleGAN’s disentangled latent space, text-guide image editing techniques make significant progress. However, as CLIP cannot perform local fine-grained image/text alignment, existing methods suffer from entanglement problems. Moreover, there lacks a deep interaction between textual tokens and visual features, which may lead to unfaithful editing results. In this paper, we propose DF-CLIP for Disentangled and Fine-grained text-guide image editing. Specifically, we design a novel dual-branch LatentMask module to generate more accurate editing directions in StyleGAN’s latent space, which can avoid changes in text-unrelated areas. Furthermore, we present a Multi-modal Interaction module to associate the text embedding with the image embedding and perform a deep interaction between them, which greatly enhance the guidance of text in image editing process and accelerate the training convergence. Extensive experiments show that our models perform more disentangled and natural editing results with a shorter training time.
Computer Science
What problem does this paper attempt to address?